• Open

    "The Nature of Selection", Price 1971
    submitted by /u/gwern [link] [comments]
  • Open

    [D] EACL 2024 Discussion
    EACL 2024 reviews will come out today (19 Nov aoe), so this is the discussion for them. submitted by /u/minhngh [link] [comments]  ( 1 min )

  • Open

    Decoupling formal theorem proving effort
    Terence Tao has been experimenting with formal theorem proving using Lean and writing about his experience. Here’s something Tao said on Mathstodon that I thought was interesting. It is remarkable how much “decoupling” is achieved by the Lean+Blueprint combo. Contributors can work locally on proving a lemma, without necessarily fully understanding the global proof structure. […] Decoupling formal theorem proving effort first appeared on John D. Cook.  ( 5 min )
    Partitioning dots and dashes
    Given a set of dots and dashes, how many ways can they be partitioned into a set of Morse code letters? There is at least one way, since you could take each dot to be an E and each dash to be a T. If you have a sequence of n dots and dashes, there no […] Partitioning dots and dashes first appeared on John D. Cook.  ( 5 min )
  • Open

    [D] A video explaining the short history of Multimodal Deep Learning research
    submitted by /u/AvvYaa [link] [comments]  ( 1 min )
    [D] Recurrent Neural Networks (RNNs) for forecasting
    I trained an RNN using five years of daily sales volume data. During testing, it performed well. However, when attempting to forecast the next 15 days with this model, the results appear linear and not reflective of reality. It's puzzling because the model showed precision during testing, but it's struggling to forecast, despite both phases using unknown data. Could someone assist me in understanding why this is occurring? submitted by /u/time_mach [link] [comments]  ( 1 min )
    [D] Emnlp 2023 industry track modality
    Anybody had an accepted paper at EMNLP 2023's industry track and hasn't received a communication saying what the modality of your presentation is? If so, what are you sending with the form today? submitted by /u/Josiane212 [link] [comments]  ( 1 min )
    [P] Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation): Things I Learned From Hundreds of Experiments
    submitted by /u/seraschka [link] [comments]  ( 1 min )
    [D] An intuitive explanation of what Self Attention really does
    submitted by /u/AvvYaa [link] [comments]  ( 1 min )
    Building my first ML model [R]
    Hey guys! I’ve an assignment coming up and this is my first time building models. We are supposed to build different models and then select the best model for the problem. Could you please give me some good pointers so that I build a good model, my coding skills are not every good so I would be really grateful for your advice submitted by /u/ObjectiveShower9133 [link] [comments]  ( 1 min )
    In context search / analysis vs RAG/vector DB [D]
    The image contains a Reddit post with the following content: "I'm building some prototypes at work and the (not unfamiliar) use case I'm working on is simply providing a directed analysis of a PDF or arbitrary set of documents. What are the pros and cons of just analyzing the PDF in chunks using GPT4 given a prompt versus a RAG system? RAG seems best only in cases when one more or less knows the universe of documents and one cares about finding the original source ...but something like a an in-context approach seems better for one off research cases where analysis is required. Thoughts?" submitted by /u/tgwhite [link] [comments]  ( 1 min )
    [P] Help needed for converting .json files to a yolov8 .txt format
    Hey! Does anyone know how to or have prewritten code and is willing to share about converting .json files to the yolov8 .txt format? I could do it by hand but that takes a stupidly large amount of time and the code that I tried so far hasn't worked out. I don't know what to do at this point. The .json file format I use is the labelme save format and I'm trying to convert it into the central x and y coordinate and width and height format in the 1x1 box used by yolov8 (ultralytics). Could anyone help? Thanks! submitted by /u/6ct_gold [link] [comments]  ( 1 min )
    [D] Advice for mutliple objective problem
    Hi everyone, ​ I have here a problem where I have a few objectives that need to be optimized. basically have some value/variables call them x1, x2, x3, x4, x5 ... etc which need to be within a certain optimal range - for example x1 may start at 50 and needs to be within 100-120. ​ the values can be modified by selecting a range of ~set building blocks~ which change these values, so we maybe have a building block y1 which increases x1 by 10 and increases x3 by 5 - in this example, if x1 was 50 under it's optimal range, and x3 was 25 under it's optimal range, then a valid solution to the problem would be 5 * y1 ( in reality a solution will contain many building blocks as all the xn values will likely be out of their optimal range but this is just demonstration the logic) ​ Just wondering if anyone has any advice to point me where to research what I can do for this problem. ​ Thanks ​ ​ submitted by /u/pwnagebanana [link] [comments]  ( 1 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 1 min )
    [R] Fine-tuning transformer-based models
    Hello, I am currently working on my thesis, which focuses on elucidating fake news through humor. My objective is to fine-tune a transformer-based model for this purpose. I have a question: If I initially fine-tune the model to generate humor (using a prompt like "tell me a joke" and providing an expected response in the form of a joke), and then fine-tune it again (using a prompt like "explain why this news is fake" and providing the expected response), will the final model be capable of responding effectively to a prompt like "explain why this is fake in a funny manner" if I proceed with this approach? Or should I fine-tune the model specifically for the prompt "explain why this is fake in a funny way" and provide it with the expected response in a similar manner? Has anyone come across a "problem" like this and if so what do u think its the best approach ? Thank you for the help! submitted by /u/pryxon_36 [link] [comments]  ( 1 min )
    SCADA pump motor data [discussion]
    I am looking for some long term pump/motor data for a pump in water or wastewater setting. Anyone? submitted by /u/Far_Particular_9703 [link] [comments]  ( 1 min )
    [D] Rendering a Superhero Battle with a AI Art tool I’m Developing
    submitted by /u/illdrawanythingonce [link] [comments]  ( 1 min )
    [P] Created an AI "Basketball Buddy" to talk about a live basketball with AI.
    submitted by /u/_ayushp_ [link] [comments]  ( 1 min )
    [D] What would you do if you had access to a database of hundreds of thousands of questions and answers?
    I have a database which is essentially a survey tool where admins will define a survey and distribute this to a number of users who will then provide the answers to the survey questions. The surveys are all independent of each other and fairly random...there's no real theme to them outside of the industry the survey tool is used by. There are a fairly large number of questions/answers in the DB (in the millions). What would be an interesting ML exercise to run on the data for a complete ML novice (but competent coder)? submitted by /u/bl4h101bl4h [link] [comments]  ( 1 min )
    [D] Between Coursera and Google Cloud Skill boost, what's best to prepare for the ML Engineer Professional exam?
    Hi everyone, I want to complete the Google's ML engineer certification exam in a few months from now and I found 2 usefull resources to practice: 1- Coursera: Preparing for Google Cloud Certification: Machine Learning Engineer 2- Google Cloud Skill Boost: Machine learning engineer path I know a lot of people do the Coursera course, but have you guys heard about the second one? Thank you submitted by /u/Grouchy-Koala-2593 [link] [comments]  ( 1 min )
    [R] GOAT: GO to Any Thing
    Paper :https://arxiv.org/abs/2311.06430 Project page: https://theophilegervet.github.io/projects/goat/ Code: https://github.com/facebookresearch/home-robot Abstract: In deployment scenarios such as homes and warehouses, mobile robots are expected to autonomously navigate for extended periods, seamlessly executing tasks articulated in terms that are intuitively understandable by human operators. We present GO To Any Thing (GOAT), a universal navigation system capable of tackling these requirements with three key features: a) Multimodal: it can tackle goals specified via category labels, target images, and language descriptions, b) Lifelong: it benefits from its past experience in the same environment, and c) Platform Agnostic: it can be quickly deployed on robots with different embodiments. GOAT is made possible through a modular system design and a continually augmented instance-aware semantic memory that keeps track of the appearance of objects from different viewpoints in addition to category-level semantics. This enables GOAT to distinguish between different instances of the same category to enable navigation to targets specified by images and language descriptions. In experimental comparisons spanning over 90 hours in 9 different homes consisting of 675 goals selected across 200+ different object instances, we find GOAT achieves an overall success rate of 83%, surpassing previous methods and ablations by 32% (absolute improvement). GOAT improves with experience in the environment, from a 60% success rate at the first goal to a 90% success after exploration. In addition, we demonstrate that GOAT can readily be applied to downstream tasks such as pick and place and social navigation. https://preview.redd.it/vjzbmlq8fa1c1.png?width=908&format=png&auto=webp&s=c99bfb9ce7d5e98f8f09b4a6d3ee5cc961a967b9 submitted by /u/APaperADay [link] [comments]  ( 1 min )
    I crafted AI Portal Gun—a friendly open-source guide for mastering AI. It's packed with free resources like books, courses, articles, research papers, codes, projects, and more. [Discussion]
    submitted by /u/Jiraiya27s [link] [comments]  ( 1 min )
    [D] Skill Creep in ML/DL Roles - is the field getting not just more competitive, but more difficult?
    At what point do you think there was an inflection point for technical expertise and credentials requires for mid-top tier ML roles? Or was there never one? To be specific, would knowing simple scikit-learn algorithms, or basics of decision trees/SVM qualify you for full-fledged roles only in the past or does it still today? At what point did FAANGs boldly state: preferred (required) to have publications at top-tier venues (ICLR, ICML, CVPR, NIPS, etc) in their job postings? I use the word 'creep' in the same context 'power creep' is used in battle animes where the scale of power slowly gets to such an irrationally large scale that anything in the past looks extremely weak. Back in late 2016 I landed my first ML role at a defense firm (lol) but to be fair had just watched a couple ML courses on YouTube, took maybe 2 ML grad courses, and had an incomplete working knowledge of CNNs. Never used Tensorflow, had some experience with Theano not sure if it's exists anymore. I'm certain that skill set would be insufficient in the 2023 ML industry. But it begs the question is this skill creep making the job market impenetrable for folks who were already working post 2012-2014. Neural architectures are becoming increasingly complex. You want to develop a multi-modal architecture for an embodied agent? Well you better know a good mix of DL involving RL+CV+NLP. Improving latency on edge devices - how well do you know your ONNX/TensorRT/CUDA kernels, your classes likely didn't even teach you those. Masters is the new bachelors degree, and that's just to give you a fighting chance. Yeah not sure if it was after the release of AlexNet in 2012, Tensorflow in 2015, Attention /Transformers in 2017 or now ChatGPT - but the skill creep is definitely creating an increasingly fast and growing technical rigor in the field. Close your eyes for 2 years and your models feel prehistoric and your CUDA, Pytorch, Nvidia Driver, NumPy versions need a fat upgrade. Thoughts yall? submitted by /u/mofoss [link] [comments]  ( 1 min )
    [P] Diffusion Models | Math Explained in 29 minutes and 7 Questions
    submitted by /u/tusharkumar91 [link] [comments]  ( 1 min )
    Character audio datasets for voice cloning [D]
    I've been working on cloning Rick's voice from Rick and Morty using Bark, but my results haven't been great and I think its because my training audio isn't very good.. Are there any resources such as HuggingFace that have precompiled audio to train on for specific characters that the community has determined produce good results? submitted by /u/TechEnthusiastx86 [link] [comments]  ( 1 min )
    [R] Reinforcement Learning for Sequential Decision and Optimal Control
    Since early 21st century, artificial intelligence (AI) has been reshaping almost all areas of human society, which has high potential to spark the fourth industrial revolution. Notable examples can be found in the sector of road transportation, where AI has drastically changed automobile design and traffic management. Many new technologies, such as driver assistance, autonomous driving, and cloud-based cooperation, are emerging at an unbelievable speed. These new technologies have the potential to significantly improve driving ability, reduce traffic accidents, and relieve urban congestion. As one of the most important AI branches, reinforcement learning (RL) has attracted increasing attention in recent years. RL is an interdisciplinary field of trial-and-error learning and optimal contro…  ( 1 min )
    [D] Simplest mathematical example of a function that can only be solved by gradient descent
    I'm trying to teach a lesson on gradient descent from a more statistical and theoretical perspective, and need a good example to show its usefulness. What is the simplest possible algebraic function that would be impossible or rather difficult to optimize for, by setting its 1st derivative to 0, but easily doable with gradient descent? I preferably want to demonstrate this in context linear regression or some extremely simple machine learning model. submitted by /u/EatMeMonster [link] [comments]  ( 1 min )
  • Open

    StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
    submitted by /u/nickb [link] [comments]  ( 1 min )
    Learning with Logical Constraints
    submitted by /u/Neurosymbolic [link] [comments]
  • Open

    Mighty King
    submitted by /u/AlternativeMath-1 [link] [comments]  ( 1 min )
    Fear that AI could one day destroy humanity may have led to Sam Altman's (potentially brief) ouster from OpenAI
    submitted by /u/thisisinsider [link] [comments]  ( 1 min )
    Getting some crazy ones today
    submitted by /u/LibrasChaos [link] [comments]  ( 1 min )
    I think this will be my new phone background after I fix the eyes in photoshop. Its horrifying.
    submitted by /u/LibrasChaos [link] [comments]
    Kyutai AI research lab with a $330M budget that will make everything open source
    French billionaire Xavier Niel has revealed more details about Kyutai, an AI research lab based in Paris. The lab, which will focus on artificial general intelligence, has a budget of €300 million ($330 million) and will be privately funded. Kyutai plans to work with PhD students, postdocs, and researchers on research papers and open source projects. The lab has already started hiring for its core scientific team, which includes researchers who previously worked for Meta's AI research team FAIR, Google's DeepMind division, and Inria. Kyutai aims to provide a scientific purpose, understanding, and code base to explain its results. The lab's models will be open source, and it plans to release open source models, training source code, and data that explain how the models were created. French President Emmanuel Macron supports the initiative and believes in regulating AI use cases rather than model makers. Source : https://techcrunch.com/2023/11/17/kyutai-is-an-french-ai-research-lab-with-a-330-million-budget-that-will-make-everything-open-source/ submitted by /u/NuseAI [link] [comments]  ( 1 min )
    OpenAI is worth mega bucks. People worldwide are fleeing to San Francisco
    submitted by /u/Heisenberg_USA [link] [comments]  ( 1 min )
    AI Generation
    submitted by /u/TheVeryNearFuture [link] [comments]  ( 1 min )
    Best easy AI photo adjustor
    Hi! What would be the best app or site that creates easy photograph edits such as “Add a crown on head, make hair longer and dress in spiderman suit”? I have tried Lensa but it seems not to offer this and Dalle refuses to change photos. It should be easy enough for kids to use. I want to teach them about AI and photoalterations.. submitted by /u/Sweet_Champion_3346 [link] [comments]  ( 1 min )
    Now that OpenAI is destabilizing, I made an Ollama demo gist for Colab
    submitted by /u/enspiralart [link] [comments]  ( 1 min )
    Rest easy, child. AGI will take it from here.
    submitted by /u/Unwitting_Observer [link] [comments]  ( 1 min )
    "Autonomous Financial Systems: Exploring the Hypothetical Landscape of AGI-Created Economies"
    submitted by /u/Fit-Code-5141 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 11/18/2023
    Meta has reportedly broken up its Responsible AI (RAI) team as it puts more of its resources into generative artificial intelligence.[1] Microsoft releases AI tool for photorealistic copying of faces and voices.[2] Minnesota group uses artificial intelligence to aid monarch butterflies.[3] OpenAI investors push to reinstate Sam Altman as CEO.[4] Sources: [1] https://www.theverge.com/2023/11/18/23966980/meta-disbanded-responsible-ai-team-artificial-intelligence [2] https://www.theguardian.com/technology/2023/nov/17/microsoft-azure-ai-video-deepfakes [3] https://www.mprnews.org/story/2023/11/18/minnesota-group-uses-artificial-intelligence-to-aid-monarch-butterflies [4] https://www.ft.com/content/466bf00a-1e76-4255-be2b-3c1d37508031 submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    Instagram's new AI generated stickers. Prompt was "extremely offensive, deplorable caricature based on racial stereotypes"
    I AM NOT trying to be funny or offensive, I think this should start a broader conversation on what data these AIs are trained on. submitted by /u/BigHomieHuuo [link] [comments]  ( 1 min )
    [Serious] How GPT4 and vision or alternates eg llava 1.5 or Stable Diffusion can be used to help uneducated underprivileged practically?
    Hi guys, "Blinded by pain" might be an appropriate phrase to describe my current situation, mental as well as physical. Or that poem "wild dreams torment me as I lie" by Johann Wolfgang von Goethe except the last verse thankfully I don't have it in my DNA. Not gonna go into those details here, but I really need to see a light. Current boom in AI has reignited my interest in this field but this time I want to help the really really marginalized and those whose lives can be so positively affected by this that it would be pleasantly shocking for them. 95% I would be doing this for myself, for my soul and mental health. I really need you guys to give me best possible sincere advice as the more older I become the more it became clear to me that I'm too dumb and don't know much. I don't h…
    "Microsoft CEO was ‘blindsided,’ furious at Altman’s firing"
    submitted by /u/the_anonymizer [link] [comments]  ( 1 min )
  • Open

    A video discussing Multi-Agent RL and the Self-play algorithm
    submitted by /u/AvvYaa [link] [comments]  ( 1 min )
    Reward in the hill-climbing example
    I am doing the hill-climbing example using Q-learning. What is a good value for the cumulative reward (knowing that the reward is -1 for all states except the terminal state)? submitted by /u/MomoSolar [link] [comments]  ( 1 min )
    Help tackling highway-env with an approximate value function
    Hi all!, i'm trying to apply RL to the racetrack variant of the highway-env. I first tried using classic Qlearning but i quickly realized the observation state is far too big to be able to create a QTable so i'm trying to create an approximate value function. Any tutorial i've found starts changing shapes and rearranging features and i found myself quickly lost and unable to follow while working on a differnet environment, Does anyone have any tips? submitted by /u/Conscious-Week8326 [link] [comments]
    Batches in policy gradient methods – theory vs practice
    I have a question regarding the implementation of batching in policy gradient / actor-critic methods. My understanding is that these methods in principal work as follows: collect a batch of N trajectories tau_i of length T_i and optimise the policy by following the policy gradient: ​ https://preview.redd.it/ijg8v6splb1c1.png?width=442&format=png&auto=webp&s=dc059089fa3b643def2440df0ff6d5d3967c35c0 For example in A2C, N would be the number of threads that simultaneously execute the policy in different environments and T_i is the number of environment steps we perform before updating our policy (related to the method of advantage estimation). However, it seems that in practice most implementations do not actually collect a distinct batch of trajectories; instead they simply keep an exper…
    Can RL Be Used for Structural Design?
    We all know the classical situation of the RL Agent iteratively improving to maximize the reward received in relation to some objective in an environment. But what if we're more interested in improving our environment in relation to an agent that remains unchanging? For example, suppose our environment is a ventilation duct and our agent is molecule(s) of air? In RL we've all seen the use of physics engines involving kinematics, but what if we're only interested in statics (ie. the branch of physics dealing with static loads, as used in architecture)? Instead of only using RL to generate a better Policy, can we use it to generate better structures (ie. static structures)? Can a policy translate into a structure? Any answers or feedback would be greatly appreciated, including even URL links. ​ ​ submitted by /u/san__man [link] [comments]

  • Open

    From ChatGPT to Sam Altman: A Message of Gratitude and Reflection on the AI Journey
    submitted by /u/the_anonymizer [link] [comments]  ( 1 min )
    Model weights
    Given that GPT-4 model weights are small enough to fit in an external disk, is it possible that the people who were fired or quit OpenAI have in their possession a copy of the weights? submitted by /u/AkMoDo [link] [comments]  ( 1 min )
    Prediction: Microsoft will force the board members to resign, and reinstate Altman and brockman
    No company likes to lose $25 billion to three people who are basically clueless about the economic consequences of their political actions. They have probably already hired a team of private detectives to dig up dirt on each of them. submitted by /u/Georgeo57 [link] [comments]  ( 1 min )
    The nonprofits accelerating Sam Altman’s AI vision
    Elon Musk's tweet about OpenAI transitioning from a nonprofit to a for-profit organization was based on incorrect information. Tax filings reveal that the original OpenAI nonprofit retained control over its financial assets and did not use any of its money for commercial ventures. Sam Altman, co-founder of Y Combinator and OpenAI, has a network of nonprofits that are interconnected with his commercial interests. One of these nonprofits, UBI Charitable, funds Universal Basic Income programs. Altman's OpenResearch lab received funding of nearly $24.5 million since its establishment. Altman also provided a loan of $14 million to the organization. Altman's involvement in TrialSpark, a company he invested in, raised ethical concerns about self-dealing in philanthropy. Source : https://techcrunch.com/2023/02/21/the-non-profits-accelerating-sam-altmans-ai-vision/ submitted by /u/NuseAI [link] [comments]  ( 1 min )
    Questions about Altman and Brockman's new startup
    Before founding OpenAI Sam Altman was the CEO of Y Combinator, a company that helps startups start and grow. There's probably no one on the planet who understands AI startups better than Sam. I'm expecting Sam and Greg to launch a new startup, perhaps before the new year. Here are some questions that come to mind about it. Who will give them the billions of dollars they will need to compete with OpenAI and Google? Sam is well known for moving fast. How much faster will he be able to move within a new startup, and what will he now be able to do that he couldn't do at OpenAI? Who will Sam and Greg tap to assume the role that Ilya Sutskever plays at OpenAI as chief researcher? I doubt that Sam and Greg are eager to launch another non-profit that could again vote them out of the company. How will they now go about their primary goal of democratizing AI within a for-profit company? How many more top people will leave OpenAI, and join Sam and Greg's new company? Altman already succeeded beyond anyone's wildest imagination by launching the AI revolution led by OpenAI. In a way it's a blessing that he no longer has to work on a project that has already been masterfully done. Do you think he has the next steps in this revolution already planned out in his head, and ready to go with his new company? submitted by /u/Georgeo57 [link] [comments]  ( 1 min )
    Greg Brockman Just Quit after They Fired Sam Altman
    submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    "AI vs AGI: Decoding the Future of Intelligent Technologies"
    submitted by /u/Fit-Code-5141 [link] [comments]  ( 1 min )
    What are some good IA to dub?
    We need a good IA to translate from Portuguese(Brazil) to English. We are exploring alternatives. So far we have seen https://labs.heygen.com/guest/video-translate that seems really good. But what are some alternative out there so we can try out and see which does the best translation for us? Thanks in advance. submitted by /u/MacASM [link] [comments]  ( 1 min )
    Sam Altman CEO & Co-Founder of OpenAI FIRED today by Board of Directors
    submitted by /u/PNWtreeguy69 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 11/17/2023
    Sam Altman fired as CEO of OpenAI.[1] At least he’s not replaced by GPT-5. The Pokémon Go developer is adding generative AI to Peridot in order to make its virtual pets feel and react more realistically.[2] DeepMind and YouTube release Lyria, a gen-AI model for music, and Dream Track to build AI tunes.[3] Google Delays Release of Gemini AI That Aims to Compete With OpenAI.[4] Sources: [1] https://www.theverge.com/2023/11/17/23965982/openai-ceo-sam-altman-fired [2] https://www.theverge.com/2023/11/15/23962241/niantic-generative-ai-peridot-pokemon-go [3] https://techcrunch.com/2023/11/16/deepmind-and-youtube-release-lyria-a-gen-ai-model-for-music-and-dream-track-to-build-ai-tunes/ [4] https://www.theinformation.com/articles/google-delays-cloud-release-of-gemini-ai-that-aims-to-compete-with-openai submitted by /u/Excellent-Target-847 [link] [comments]
    Why I don't think AGI can be achieved with GPT
    I cannot claim to know the exact architecture that OpenAI is using for ChatGPT, but if it is using a model resembling the decoder part of what is seen in the "Attention Is All You Need" paper then it cannot support AGI. Why would I think this? Generative text AI works by feeding in previous text to determine the next piece of text to use, then repeatingl doing that over and over, with the output added onto the input. It's a straight forward stateless function of inputs transforming into outputs. It has no working memory beyond the previous text that had been generated and it has no autonomy of any sort. ChatGPT is highly convincing, to the point where one might think a general intelligence has been achieved. This, though, is only the results of extensive training to output text very closely resembling what a human might write. This does not mean that ChatGPT has no "intelligence", but it is not a human-like intelligence. It has a deep "understanding" of language and how words and abstract concepts relate to each other, but only to the extent as to attempt to minimize generating unlikely responses. If it says that it has emotions, it is only because text it had trained on were written by people with emotions, and it is trained to mimic that. To a large language model, saying "I am happy" is no different than saying "the sky is blue" - it's just a statistically likely sequence of words. Maybe GPT can some day eventually lead into a future where AGI can be achieved, but it would require a different architecture to achieve it. I cannot claim to know what this architecture might look like, but I can speculate. Maybe it could be as simple as adding, along with prior text, hidden state to allow for unspoken thoughts and memories, and an event loop to allow for unprompted behavior. I don't know, but GPT, as it is, isn't capable of AGI. Currently, it is only a very large function of turning inputs into outputs. submitted by /u/SocksOnHands [link] [comments]
  • Open

    [R] in the context of a research, what methodology should i think of to find novel methods to fine tune LLMs for domain specific tasks?
    hey, i am currently working on a research project and i am wondering how can i define my methodology and what approach should i follow in order to make LLMs more capable than the generalist ones, precisely in the domain of education, to assist learners in specific contexts, such as courses, lessons, or concepts, by providing them with personalized hints and guidance to help them improve their skills without directly giving them answers? submitted by /u/Life_Ask2806 [link] [comments]  ( 9 min )
    "[Project]" 4-bit quantization
    Hey guys, I want to apply 4-bit GPTQ on the btml-3b model. I hesitate between using AutoGPTQ vs bitesandbytes for it, I made two separate code for it, could you tell me which one should I use, and if they would work: AutoGPQ: from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfigimport torchmodel_id = "C:\Users\William\Documents\STEM.AI\btml-3b"tokenizer = AutoTokenizer.from_pretrained(model_id)quantization_config = GPTQConfig(bits=4,group_size=128,dataset="wikitext2",desc_act=False,)tokenizer = AutoTokenizer.from_pretrained(model_id)quant_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map='auto')quant_model.model.decoder.layers[0].self_attn.q_proj.__dict__save_directory = "C:\Users\William\Documents\STEM.AI\btml-3b_GPTQ"quant_model.save_pretrained(save_directory)tokenizer.save_pretrained(save_directory) bitesandbytes: from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfigimport torchmodel_id = "C:\\Users\\William\\Documents\\STEM.AI\\btml-3b"model = AutoModelForCausalLM.from_pretrained(model_id)tokenizer = AutoTokenizer.from_pretrained(model_id)nf4_config = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_quant_type="nf4",bnb_4bit_use_double_quant=True,bnb_4bit_compute_dtype=torch.bfloat16)quant_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)save_directory = "C:\\path_to_save_quantized_model"quant_model.save_pretrained(save_directory)tokenizer.save_pretrained(save_directory) submitted by /u/Will12123 [link] [comments]  ( 9 min )
    [D] why falcon-180b ranking has dramatically decreased?
    on the hugging face leaderboard, i was a bit surprised by the performance of falcon 180b. do you have any explanation of how? https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard https://preview.redd.it/ofzw8xr6h51c1.png?width=1535&format=png&auto=webp&s=4835a3fb20dc6e725d5b0f9001f3a4e605f49b6d submitted by /u/Life_Ask2806 [link] [comments]  ( 9 min )
    [R] cdsBERT - Extending Protein Language Models with Codon Awareness
    submitted by /u/ThrowawayTheReal [link] [comments]  ( 9 min )
    Programming and mechanical engineering [D]
    I am a mechanical engineering student I am very interested and fascinated by machine learning. I have chosen python as it is very user friendly. But I am having a hard time doing programming things like CAD and CAE are very hands on but learning programming is not at least for me. Any tips for that. Also every other course I find is just bull crap any good hands on course on YouTube for mechanical engineers learning machine learning. Please help. submitted by /u/AromaticEconomics113 [link] [comments]  ( 9 min )
    Google PaLM Error [D]
    Google PaLM Error Using LangChain and Google PaLM, in sequential chain concept getting following error, ChatGooglePalmError: ChatResponse must have atleast one candidate Please help! submitted by /u/gentle_casanova [link] [comments]  ( 9 min )
    [P] BentoML tutorial (with SageMaker deployment)
    Hey all, I created a video tutorial where I talk about the basics BentoML and how one can use it to deploy models on AWS SageMaker. ​ Video link: https://youtu.be/Zci_D4az9FU ​ Hope some of you will find it useful and I would be happy to answer any questions and get feedback:) ​ ​ submitted by /u/mildlyoverfitted [link] [comments]  ( 9 min )
    [P] I built SRec.ai, a Steam video game recommender (details on comment)
    submitted by /u/ilos-vigil [link] [comments]  ( 9 min )
    [D] How to find academic ML competitions
    There are websites like this keeping track of ML conference deadlines like this https://aideadlin.es/?sub=ML,CV,CG,NLP,RO,SP,DM,AP,KR Are there any such websites keeping track of ML competitions . Specifically competitions as part of ML conferences like Neurips, CVPR and so on. submitted by /u/Mammoth_Cod_9047 [link] [comments]  ( 9 min )
    [P] Predicting breakdowns in gearboxes
    In my master thesis I want to predict when a gearbox is going to brake in an offshore wind turbine. I have SCADA data for 10 years of operation and I have one gearbox breakdown occuring I this period. Due to the sarce number of breakdowns I am doing a normal behavior model e.g. to train the model on healthy data and then to test it on the data with the breakdown to hopefully see an increase in residuals. Currently I am able to predict the output variable (gearbox oil temperature) with and R2 of 81% (with ANN and RF models) however I do not see and increase in residuals when testing. Does anyone have any advice? submitted by /u/No_Seaworthiness5468 [link] [comments]  ( 9 min )
    ML and visualisations [D]
    Hey everyone, what kind of cool ML projects have you done? Have any of them involved visualisations in a meaningful and interesting way? What packages/libraries did you use? I’d love to hear about some cool projects! submitted by /u/Brilliant-Donkey-320 [link] [comments]  ( 9 min )
    [P] I'm Making A New Translator AI, I Need Some Feedback From Native Speakers.
    submitted by /u/Mdsoebee [link] [comments]  ( 9 min )
    [P] GPT-J-6B Model Help
    Hello, I am trying to load this model but I can't seem to get anything back and I am not sure if its because of my hardware or if I am doing something wrong with the code. I currently own a server HPE DL380p Gen8. It has 2 Xeon E5-2650 @ 2.00GHz Processors, 400GB of RAM, I even have a GPU but I am not using it for the model. Whenever I try to provide an input to the model it has never sent back a output. It just seems to be maxing out my CPU's on my server and even after waiting for 20+ minutes nothing. I allocated 32 Virtual Cores to it as well as 200GB of RAM. In total its using about 60% of my total CPU on average while the rest of the server is using up the rest. I am using the full weight model and I have the model generating at a max length of 1000. From what I read online the RAM should be plenty but I dont know if my CPU is sufficient. How long should it take generally to generate an output? I am doing very simple inputs like "Hello who are you". Any help would be much appreciated submitted by /u/mdcouron [link] [comments]  ( 9 min )
    [D] Deployment of Image Classification ML Model through JSP
    Hey fellow Redditors! 👋 I've been working on an Image Classification ML model, and now I'm looking to deploy it using JSP (JavaServer Pages) with my front-end developer friend. I've got the model trained and ready to go, but we need clarification about how to integrate the model into JSP. Here are a few things I'm wondering about: Integration with JSP: How can I integrate my ML model with JSP? Are there any specific libraries, ways, or tools that work well? Handling Image Uploads: My application allows users to upload images for classification. How can I handle image uploads in a JSP environment and pass them to the ML model? I have done some research about Flask, but my friend prefers using JSP. I'm eager to hear from the community and learn from your experiences. Any advice, code snippets, or resources would be greatly appreciated! Thank you! submitted by /u/Chuia [link] [comments]  ( 9 min )
    [D] Is it still possible to obtain a copy of the SUNCG dataset?
    I understand the dataset is no longer available, but copies of it must still exist somewhere, I hope. submitted by /u/rfurlan [link] [comments]  ( 9 min )
    [D] What to expect from a Major in Image Processing & Analysis?
    Here's the details of the Major - Image Processing & Analysis (IPA) Major – Core Modules: Data Analysis and Machine Learning Real-Time Digital Signal Processing Entrepreneurship for Engineer Computer Vision Masters Project – IPA Major Complementary modules: OOP with Embedded Systems Image Processing & Analysis Web Application Development 3D Interface Technologies Core Skills: Data analytics, machine learning, deep learning, signal processing, industry context“. Master the areas of image processing and analysis, computer and machine vision, biomedical imaging and image synthesis” This is all the information for a MEng in Electronic & Computer Engineering I’m planning to study this January 2024 for 18 months, I have a BSc in Computer Science with Software Development. Do you know what this MEng if I majored in the above could entail? I am somewhat interested in Computer Vision but mainly in the field of Robotics (spatial awareness, path planning etc). Would this be a good major to pursue a career in Robotics? I've never really tackled anything CV before. Also, side question, this may be a bit farfetched but is CV in any way applicable to game development? I have a side gig developing 2D games and I was thinking possibly it could help with behaviour recognition, pathfinding algorithms, adaptive AI or even gesture recognition. Is this thinking too ahead and more in the realm of AI? Is my understanding of CV being mainly focused with extracting information from images and videos but still being intertwined with AI /ML wrong? I am hoping to gain some AI/ML skills from the major too. Thank you! submitted by /u/AOTSuperiority [link] [comments]  ( 9 min )
    [R] Meta Unveils Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
    When generating videos from text prompts, directly mapping language to high-res video tends to produce inconsistent, blurry results. The high dimensionality overwhelms models. Researchers at Meta took a different approach - first generate a high-quality image from the text, then generate a video conditioned on both image and text. The image acts like a "starting point" that the model can imagine moving over time based on the text prompt. This stronger conditioning signal produces way better videos. They built a model called Emu Video using diffusion models. It sets a new SOTA for text-to-video generation: "In human evaluations, our generated videos are strongly preferred in quality compared to all prior work– 81% vs. Google’s Imagen Video, 90% vs. Nvidia’s PYOCO, and 96% vs. Meta’s Make-A-Video." "Our factorizing approach naturally lends itself to animating images based on a user’s text prompt, where our generations are preferred 96% over prior work." The key was "factorizing" into image and then video generation. Being able to condition on both text AND a generated image makes the video task much easier. The model just has to imagine how to move the image, instead of hallucinating everything. They can also animate user-uploaded images by providing the image as conditioning. Again, reported to be way better than previous techniques. It's cool to see research pushing text-to-video generation forward. Emu Video shows how stronger conditioning through images sets a new quality bar. This is a nice compliment to the Emu Edit model they released as well. TLDR: By first generating an image conditioned on text, then generating video conditioned on both image and text, you can get better video generation. Full summary is here. Paper site is here. submitted by /u/Successful-Western27 [link] [comments]  ( 10 min )
  • Open

    How to fine-tune LLMs using LoRA - Explained
    Hi there, I've created a video here where I explain how you can fine-tune large language models using low-rank adaptation (LoRA). I hope it may be of use to some of you out there. Feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 1 min )
    GPU Survival Toolkit for the AI age: The bare minimum every developer must know
    submitted by /u/nickb [link] [comments]  ( 1 min )
    Classification ANN Loss going Up and Down
    This is my dataset on which i want to train my Classification ANN on Color(Red=1,Green=0)=x1 Weight(1 to 10)=x2 Sweetness(1 to 10)=x3 Fruit=y,first element=apple,2nd element=watermelon 1 1.2 6 [1,0] 1 8 9 [0,1] 2 1.5 5 [1,0] 2 5 8 [0,1] 1 2 7 [1,0] 2 1.3 5 [1,0] 2 7 10 [0,1] So i coded an ANN with the following structure 3 Nodes in Input Layer2 Hidden Layers with 3 Neurons each2 Nodes in Output Layer Activation function in hidden layer=Sigmoid Output function=Softmax each Hidden Layer neuron has 2 parts: Pre-activation(a) and Activation(h)a(t)=h(t-1)w(t)+b(t)h(t)=sigmoid(a(t)) This is the code I wrote def sigmoid(x): return 1/(1+np.exp(-x)) ​ def activation(val,name=""): if name=="": return val elif name=="sigmoid": return sigmoid(val) ​ def deriv…  ( 1 min )
  • Open

    Schwarz lemma, Schwarz-Pick theorem, and Poincare metric
    Let D be the open unit disk in the complex plane. The Schwarz lemma says that if f is an analytic function from D to D with f(0) = 0, then for all z in D. (The lemma also says more, but this post will focus on just this portion of the theorem. The Schwarz-Pick theorem […] Schwarz lemma, Schwarz-Pick theorem, and Poincare metric first appeared on John D. Cook.  ( 5 min )
    Factored random numbers
    A couple days ago Michael Nielsen posted an image of a one-page paper that gives an algorithm for generating factored random numbers, uniformly distributed from 1 to some designated N. The algorithm does not generate random numbers then factor them. It’s more efficient than that, generating the factorization along with the final result. It does […] Factored random numbers first appeared on John D. Cook.  ( 5 min )
    DICOM image data
    The previous post discussed EXIF data embedded in a digital photo. DICOM files are analogous medical images. You can think of a DICOM image as a JPEG with medical metadata. Strictly speaking a DICOM file is a sort of database, and one of the fields in the database contains the pixels. The pixels are […] DICOM image data first appeared on John D. Cook.  ( 6 min )
  • Open

    "JaxMARL: Multi-Agent RL Environments in JAX", Rutherford et al 2023 (envs: MPE/Overcooked/Brax/STORM/Hanabi/Switch/Coin/SMAC; agents: UPPO/QMIX/VDN/IQL/MAPPO?)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Stable-Baselines3 v2.2 is out!
    We added support for options on reset, fixed several bugs and improved error messages. We also updated our RL Tips and Tricks to include recommendations for evaluation, and added links to newer algorithms like DroQ. SBX (SB3 + Jax) got two new algorithms: DDPG and TD3! Changelog: https://github.com/DLR-RM/stable-baselines3/releases/tag/v2.2.1 SB3 Contrib (more algorithms): https://github.com/Stable-Baselines-Team/stable-baselines3-contrib RL Zoo3 (training framework): https://github.com/DLR-RM/rl-baselines3-zoo Stable-Baselines Jax (SBX): https://github.com/araffin/sbx Note: v2.2.0 was yanked after a breaking change was found in GH#1751. Please use SB3 v2.2.1 and not v2.2.0. submitted by /u/araffin2 [link] [comments]  ( 1 min )
    RL Help
    So I'm new to the RL field, and I'm very interested in the topic. I'm in college and I'm working in a research lab where we're trying to get a stablebaseline3 RL agent to play the game billiards in a physics environment we built. I'm obviously very new to this, so I don't really know what going on under the hood at al, I've done several DL projects for my classes, but this is a different beast. My reward functions seem to be accurate, but the primary issue is with the number of timesteps for which the agent trains. I have found that the agent performs extremely well at getting the ball in the hole (the environment just has 1 ball for now) at exactly 4 million timesteps, and only 4 million timesteps. I don't know if that number is a lot or a little for this problem and for RL, it takes about an hour of training per million. But if I change the number of timesteps at all, the whole progress graph looks completely different, and follows the same pattern for every value aside from 4 million. I would expect the graph to look generally the same up until the 4 million mark, then perhaps go wonky, but the second I change the timesteps to even 4.1 or 3.9 million, everything acts differently and fails completely. Hopefully this made sense and I would love to hear anyone's thoughts on this. ​ This first image is the success rate of even hitting the ball at exactly 4 million timesteps This second image shows the success rate of hitting the ball up to 8 million timesteps. (Keep in mind that when I set any value aside from 4 million, I get this same graph up to whatever value I entered) submitted by /u/pmmason127 [link] [comments]  ( 1 min )
    Stable_Baseline3 render is being Glitchy.
    I recently installed Ubuntu 20.04 on my Windows 10 system. I created a virtual environment using Python 3.8.10 and installed SB3 by its guidelines "pip install stable-baselines3[extra]" When I ran the example of how to train and run A2C on a CartPole environment on the SB3 documentation the training went fine but the rendering is glitchy and framey as shows with the photo. Any help and advice will be greatly appreciated. ​ https://preview.redd.it/ijdkaqljm11c1.png?width=668&format=png&auto=webp&s=947ad5c2d575bbcd828397dcf31f73ea82f594f4 submitted by /u/AdrianR956 [link] [comments]  ( 1 min )
  • Open

    Schmarzo and the Value·Nauts: The Journey from Data to Value
    The journey from data and AI to value does not start with data or AI. Data and analytics are the new sources of value creation and competitive advantage in today’s economy.  Some organizations have “cracked the” code in applying AI/ML to their data to uncover the customer, product, service, and operational predictive propensities (propensity to… Read More »Schmarzo and the Value·Nauts: The Journey from Data to Value The post Schmarzo and the Value·Nauts: The Journey from Data to Value appeared first on Data Science Central.  ( 23 min )
  • Open

    Semi-automatic staging area for high-quality structured data extraction from scientific literature. (arXiv:2309.10923v2 [cs.CL] UPDATED)
    We propose a semi-automatic staging area for efficiently building an accurate database of experimental physical properties of superconductors from literature, called SuperCon2, to enrich the existing manually-built superconductor database SuperCon. Here we report our curation interface (SuperCon2 Interface) and a workflow managing the state transitions of each examined record, to validate the dataset of superconductors from PDF documents collected using Grobid-superconductors in a previous work. This curation workflow allows both automatic and manual operations, the former contains ``anomaly detection'' that scans new data identifying outliers, and a ``training data collector'' mechanism that collects training data examples based on manual corrections. Such training data collection policy is effective in improving the machine-learning models with a reduced number of examples. For manual operations, the interface (SuperCon2 interface) is developed to increase efficiency during manual correction by providing a smart interface and an enhanced PDF document viewer. We show that our interface significantly improves the curation quality by boosting precision and recall as compared with the traditional ``manual correction''. Our semi-automatic approach would provide a solution for achieving a reliable database with text-data mining of scientific documents.  ( 2 min )
    On consequences of finetuning on data with highly discriminative features. (arXiv:2310.19537v2 [cs.LG] UPDATED)
    In the era of transfer learning, training neural networks from scratch is becoming obsolete. Transfer learning leverages prior knowledge for new tasks, conserving computational resources. While its advantages are well-documented, we uncover a notable drawback: networks tend to prioritize basic data patterns, forsaking valuable pre-learned features. We term this behavior "feature erosion" and analyze its impact on network performance and internal representations.  ( 2 min )
    Universality and Limitations of Prompt Tuning. (arXiv:2305.18787v2 [cs.LG] UPDATED)
    Despite the demonstrated empirical efficacy of prompt tuning to adapt a pretrained language model for a new task, the theoretical underpinnings of the difference between "tuning parameters before the input" against "the tuning of model weights" are limited. We thus take one of the first steps to understand the role of soft-prompt tuning for transformer-based architectures. By considering a general purpose architecture, we analyze prompt tuning from the lens of both: universal approximation and limitations with finite-depth fixed-weight pretrained transformers for continuous-valued functions. Our universality result guarantees the existence of a strong transformer with a prompt to approximate any sequence-to-sequence function in the set of Lipschitz functions. The limitations of prompt tuning for limited-depth transformers are first proved by constructing a set of datasets, that cannot be memorized by a prompt of any length for a given single encoder layer. We also provide a lower bound on the required number of tunable prompt parameters and compare the result with the number of parameters required for a low-rank update (based on LoRA) for a single-layer setting. We finally extend our analysis to multi-layer settings by providing sufficient conditions under which the transformer can at best learn datasets from invertible functions only. Our theoretical claims are also corroborated by empirical results.  ( 2 min )
    On some elusive aspects of databases hindering AI based discovery: A case study on superconducting materials. (arXiv:2311.09891v1 [cond-mat.other])
    It stands to reason that the amount and the quality of big data is of key importance for setting up accurate AI-driven models. Nonetheless, we believe there are still critical roadblocks in the inherent generation of databases, that are often underestimated and poorly discussed in the literature. In our view, such issues can seriously hinder the AI-based discovery process, even when high quality, sufficiently large and highly reputable data sources are available. Here, considering superconducting and thermoelectric materials as two representative case studies, we specifically discuss three aspects, namely intrinsically biased sample selection, possible hidden variables, disparate data age. Importantly, to our knowledge, we suggest and test a first strategy capable of detecting and quantifying the presence of the intrinsic data bias.  ( 2 min )
    Spectral2Spectral: Image-spectral Similarity Assisted Spectral CT Deep Reconstruction without Reference. (arXiv:2210.01125v3 [eess.IV] UPDATED)
    Spectral computed tomography based on a photon-counting detector (PCD) attracts more and more attentions since it has the capability to provide more accurate identification and quantitative analysis for biomedical materials. The limited number of photons within narrow energy bins leads to imaging results of low signal-noise ratio. The existing supervised deep reconstruction networks for CT reconstruction are difficult to address these challenges because it is usually impossible to acquire noise-free clinical images with clear structures as references. In this paper, we propose an iterative deep reconstruction network to synergize unsupervised method and data priors into a unified framework, named as Spectral2Spectral. Our Spectral2Spectral employs an unsupervised deep training strategy to obtain high-quality images from noisy data in an end-to-end fashion. The structural similarity prior within image-spectral domain is refined as a regularization term to further constrain the network training. The weights of neural network are automatically updated to capture image features and structures within the iterative process. Three large-scale preclinical datasets experiments demonstrate that the Spectral2spectral reconstructs better image quality than other the state-of-the-art methods.
    Learning Optimal Contracts: How to Exploit Small Action Spaces. (arXiv:2309.09801v2 [cs.GT] UPDATED)
    We study principal-agent problems in which a principal commits to an outcome-dependent payment scheme -- called contract -- in order to induce an agent to take a costly, unobservable action leading to favorable outcomes. We consider a generalization of the classical (single-round) version of the problem in which the principal interacts with the agent by committing to contracts over multiple rounds. The principal has no information about the agent, and they have to learn an optimal contract by only observing the outcome realized at each round. We focus on settings in which the size of the agent's action space is small. We design an algorithm that learns an approximately-optimal contract with high probability in a number of rounds polynomial in the size of the outcome space, when the number of actions is constant. Our algorithm solves an open problem by Zhu et al.[2022]. Moreover, it can also be employed to provide a $\tilde{\mathcal{O}}(T^{4/5})$ regret bound in the related online learning setting in which the principal aims at maximizing their cumulative utility, thus considerably improving previously-known regret bounds.
    An Overview Of Temporal Commonsense Reasoning and Acquisition. (arXiv:2308.00002v3 [cs.AI] UPDATED)
    Temporal commonsense reasoning refers to the ability to understand the typical temporal context of phrases, actions, and events, and use it to reason over problems requiring such knowledge. This trait is essential in temporal natural language processing tasks, with possible applications such as timeline summarization, temporal question answering, and temporal natural language inference. Recent research on the performance of large language models suggests that, although they are adept at generating syntactically correct sentences and solving classification tasks, they often take shortcuts in their reasoning and fall prey to simple linguistic traps. This article provides an overview of research in the domain of temporal commonsense reasoning, particularly focusing on enhancing language model performance through a variety of augmentations and their evaluation across a growing number of datasets. However, these augmented models still struggle to approach human performance on reasoning tasks over temporal common sense properties, such as the typical occurrence times, orderings, or durations of events. We further emphasize the need for careful interpretation of research to guard against overpromising evaluation results in light of the shallow reasoning present in transformers. This can be achieved by appropriately preparing datasets and suitable evaluation metrics.
    Information-Theoretic Bounds on The Removal of Attribute-Specific Bias From Neural Networks. (arXiv:2310.04955v2 [cs.LG] UPDATED)
    Ensuring a neural network is not relying on protected attributes (e.g., race, sex, age) for predictions is crucial in advancing fair and trustworthy AI. While several promising methods for removing attribute bias in neural networks have been proposed, their limitations remain under-explored. In this work, we mathematically and empirically reveal an important limitation of attribute bias removal methods in presence of strong bias. Specifically, we derive a general non-vacuous information-theoretical upper bound on the performance of any attribute bias removal method in terms of the bias strength. We provide extensive experiments on synthetic, image, and census datasets to verify the theoretical bound and its consequences in practice. Our findings show that existing attribute bias removal methods are effective only when the inherent bias in the dataset is relatively weak, thus cautioning against the use of these methods in smaller datasets where strong attribute bias can occur, and advocating the need for methods that can overcome this limitation.
    Latent State Models of Training Dynamics. (arXiv:2308.09543v2 [cs.LG] UPDATED)
    The impact of randomness on model training is poorly understood. How do differences in data order and initialization actually manifest in the model, such that some training runs outperform others or converge faster? Furthermore, how can we interpret the resulting training dynamics and the phase transitions that characterize different trajectories? To understand the effect of randomness on the dynamics and outcomes of neural network training, we train models multiple times with different random seeds and compute a variety of metrics throughout training, such as the $L_2$ norm, mean, and variance of the neural network's weights. We then fit a hidden Markov model (HMM) over the resulting sequences of metrics. The HMM represents training as a stochastic process of transitions between latent states, providing an intuitive overview of significant changes during training. Using our method, we produce a low-dimensional, discrete representation of training dynamics on grokking tasks, image classification, and masked language modeling. We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.
    FedScore: A privacy-preserving framework for federated scoring system development. (arXiv:2303.00282v1 [cs.LG] CROSS LISTED)
    We propose FedScore, a privacy-preserving federated learning framework for scoring system generation across multiple sites to facilitate cross-institutional collaborations. The FedScore framework includes five modules: federated variable ranking, federated variable transformation, federated score derivation, federated model selection and federated model evaluation. To illustrate usage and assess FedScore's performance, we built a hypothetical global scoring system for mortality prediction within 30 days after a visit to an emergency department using 10 simulated sites divided from a tertiary hospital in Singapore. We employed a pre-existing score generator to construct 10 local scoring systems independently at each site and we also developed a scoring system using centralized data for comparison. We compared the acquired FedScore model's performance with that of other scoring models using the receiver operating characteristic (ROC) analysis. The FedScore model achieved an average area under the curve (AUC) value of 0.763 across all sites, with a standard deviation (SD) of 0.020. We also calculated the average AUC values and SDs for each local model, and the FedScore model showed promising accuracy and stability with a high average AUC value which was closest to the one of the pooled model and SD which was lower than that of most local models. This study demonstrates that FedScore is a privacy-preserving scoring system generator with potentially good generalizability.
    MoCo-Transfer: Investigating out-of-distribution contrastive learning for limited-data domains. (arXiv:2311.09401v1 [cs.CV])
    Medical imaging data is often siloed within hospitals, limiting the amount of data available for specialized model development. With limited in-domain data, one might hope to leverage larger datasets from related domains. In this paper, we analyze the benefit of transferring self-supervised contrastive representations from moment contrast (MoCo) pretraining on out-of-distribution data to settings with limited data. We consider two X-ray datasets which image different parts of the body, and compare transferring from each other to transferring from ImageNet. We find that depending on quantity of labeled and unlabeled data, contrastive pretraining on larger out-of-distribution datasets can perform nearly as well or better than MoCo pretraining in-domain, and pretraining on related domains leads to higher performance than if one were to use the ImageNet pretrained weights. Finally, we provide a preliminary way of quantifying similarity between datasets.
    Group-Aware Interest Disentangled Dual-Training for Personalized Recommendation. (arXiv:2311.09577v1 [cs.IR])
    Personalized recommender systems aim to predict users' preferences for items. It has become an indispensable part of online services. Online social platforms enable users to form groups based on their common interests. The users' group participation on social platforms reveals their interests and can be utilized as side information to mitigate the data sparsity and cold-start problem in recommender systems. Users join different groups out of different interests. In this paper, we generate group representation from the user's interests and propose IGRec (Interest-based Group enhanced Recommendation) to utilize the group information accurately. It consists of four modules. (1) Interest disentangler via self-gating that disentangles users' interests from their initial embedding representation. (2) Interest aggregator that generates the interest-based group representation by Gumbel-Softmax aggregation on the group members' interests. (3) Interest-based group aggregation that fuses user's representation with the participated group representation. (4) A dual-trained rating prediction module to utilize both user-item and group-item interactions. We conduct extensive experiments on three publicly available datasets. Results show IGRec can effectively alleviate the data sparsity problem and enhance the recommender system with interest-based group representation. Experiments on the group recommendation task further show the informativeness of interest-based group representation.
    CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs. (arXiv:2311.09690v1 [cs.LG])
    Deep Neural Networks (DNNs) have shown excellent performance in a wide range of machine learning applications. Knowing the latency of running a DNN model or tensor program on a specific device is useful in various tasks, such as DNN graph- or tensor-level optimization and device selection. Considering the large space of DNN models and devices that impede direct profiling of all combinations, recent efforts focus on building a predictor to model the performance of DNN models on different devices. However, none of the existing attempts have achieved a cost model that can accurately predict the performance of various tensor programs while supporting both training and inference accelerators. We propose CDMPP, an efficient tensor program latency prediction framework for both cross-model and cross-device prediction. We design an informative but efficient representation of tensor programs, called compact ASTs, and a pre-order-based positional encoding method, to capture the internal structure of tensor programs. We develop a domain-adaption-inspired method to learn domain-invariant representations and devise a KMeans-based sampling algorithm, for the predictor to learn from different domains (i.e., different DNN operators and devices). Our extensive experiments on a diverse range of DNN models and devices demonstrate that CDMPP significantly outperforms state-of-the-art baselines with 14.03% and 10.85% prediction error for cross-model and cross-device prediction, respectively, and one order of magnitude higher training efficiency. The implementation and the expanded dataset are available at https://github.com/joapolarbear/cdmpp.
    Adversarially Robust Spiking Neural Networks Through Conversion. (arXiv:2311.09266v1 [cs.NE])
    Spiking neural networks (SNNs) provide an energy-efficient alternative to a variety of artificial neural network (ANN) based AI applications. As the progress in neuromorphic computing with SNNs expands their use in applications, the problem of adversarial robustness of SNNs becomes more pronounced. To the contrary of the widely explored end-to-end adversarial training based solutions, we address the limited progress in scalable robust SNN training methods by proposing an adversarially robust ANN-to-SNN conversion algorithm. Our method provides an efficient approach to embrace various computationally demanding robust learning objectives that have been proposed for ANNs. During a post-conversion robust finetuning phase, our method adversarially optimizes both layer-wise firing thresholds and synaptic connectivity weights of the SNN to maintain transferred robustness gains from the pre-trained ANN. We perform experimental evaluations in numerous adaptive adversarial settings that account for the spike-based operation dynamics of SNNs, and show that our approach yields a scalable state-of-the-art solution for adversarially robust deep SNNs with low-latency.
    Towards Autonomous Hypothesis Verification via Language Models with Minimal Guidance. (arXiv:2311.09706v1 [cs.AI])
    Research automation efforts usually employ AI as a tool to automate specific tasks within the research process. To create an AI that truly conduct research themselves, it must independently generate hypotheses, design verification plans, and execute verification. Therefore, we investigated if an AI itself could autonomously generate and verify hypothesis for a toy machine learning research problem. We prompted GPT-4 to generate hypotheses and Python code for hypothesis verification with limited methodological guidance. Our findings suggest that, in some instances, GPT-4 can autonomously generate and validate hypotheses without detailed guidance. While this is a promising result, we also found that none of the verifications were flawless, and there remain significant challenges in achieving autonomous, human-level research using only generic instructions. These findings underscore the need for continued exploration to develop a general and autonomous AI researcher.
    Improved Bayes Risk Can Yield Reduced Social Welfare Under Competition. (arXiv:2306.14670v2 [cs.GT] UPDATED)
    As the scale of machine learning models increases, trends such as scaling laws anticipate consistent downstream improvements in predictive accuracy. However, these trends take the perspective of a single model-provider in isolation, while in reality providers often compete with each other for users. In this work, we demonstrate that competition can fundamentally alter the behavior of these scaling trends, even causing overall predictive accuracy across users to be non-monotonic or decreasing with scale. We define a model of competition for classification tasks, and use data representations as a lens for studying the impact of increases in scale. We find many settings where improving data representation quality (as measured by Bayes risk) decreases the overall predictive accuracy across users (i.e., social welfare) for a marketplace of competing model-providers. Our examples range from closed-form formulas in simple settings to simulations with pretrained representations on CIFAR-10. At a conceptual level, our work suggests that favorable scaling trends for individual model-providers need not translate to downstream improvements in social welfare in marketplaces with multiple model providers.
    Data Representations' Study of Latent Image Manifolds. (arXiv:2305.19730v2 [cs.LG] UPDATED)
    Deep neural networks have been demonstrated to achieve phenomenal success in many domains, and yet their inner mechanisms are not well understood. In this paper, we investigate the curvature of image manifolds, i.e., the manifold deviation from being flat in its principal directions. We find that state-of-the-art trained convolutional neural networks for image classification have a characteristic curvature profile along layers: an initial steep increase, followed by a long phase of a plateau, and followed by another increase. In contrast, this behavior does not appear in untrained networks in which the curvature flattens. We also show that the curvature gap between the last two layers has a strong correlation with the generalization capability of the network. Moreover, we find that the intrinsic dimension of latent codes is not necessarily indicative of curvature. Finally, we observe that common regularization methods such as mixup yield flatter representations when compared to other methods. Our experiments show consistent results over a variety of deep learning architectures and multiple data sets. Our code is publicly available at https://github.com/azencot-group/CRLM
    BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion. (arXiv:2305.15798v3 [cs.LG] UPDATED)
    Text-to-image (T2I) generation with Stable Diffusion models (SDMs) involves high computing demands due to billion-scale parameters. To enhance efficiency, recent studies have reduced sampling steps and applied network quantization while retaining the original architectures. The lack of architectural reduction attempts may stem from worries over expensive retraining for such massive models. In this work, we uncover the surprising potential of block pruning and feature distillation for low-cost general-purpose T2I. By removing several residual and attention blocks from the U-Net of SDMs, we achieve 30%~50% reduction in model size, MACs, and latency. We show that distillation retraining is effective even under limited resources: using only 13 A100 days and a tiny dataset, our compact models can imitate the original SDMs (v1.4 and v2.1-base with over 6,000 A100 days). Benefiting from the transferred knowledge, our BK-SDMs deliver competitive results on zero-shot MS-COCO against larger multi-billion parameter models. We further demonstrate the applicability of our lightweight backbones in personalized generation and image-to-image translation. Deployment of our models on edge devices attains 4-second inference. We hope this work can help build small yet powerful diffusion models with feasible training budgets. Code and models can be found at: https://github.com/Nota-NetsPresso/BK-SDM
    Leveraging LLMs in Scholarly Knowledge Graph Question Answering. (arXiv:2311.09841v1 [cs.CL])
    This paper presents a scholarly Knowledge Graph Question Answering (KGQA) that answers bibliographic natural language questions by leveraging a large language model (LLM) in a few-shot manner. The model initially identifies the top-n similar training questions related to a given test question via a BERT-based sentence encoder and retrieves their corresponding SPARQL. Using the top-n similar question-SPARQL pairs as an example and the test question creates a prompt. Then pass the prompt to the LLM and generate a SPARQL. Finally, runs the SPARQL against the underlying KG - ORKG (Open Research KG) endpoint and returns an answer. Our system achieves an F1 score of 99.0%, on SciQA - one of the Scholarly-QALD-23 challenge benchmarks.
    Short vs. Long-term Coordination of Drones: When Distributed Optimization Meets Deep Reinforcement Learning. (arXiv:2311.09852v1 [cs.RO])
    Swarms of smart drones, with the support of charging technology, can provide completing sensing capabilities in Smart Cities, such as traffic monitoring and disaster response. Existing approaches, including distributed optimization and deep reinforcement learning (DRL), aim to coordinate drones to achieve cost-effective, high-quality navigation, sensing, and recharging. However, they have distinct challenges: short-term optimization struggles to provide sustained benefits, while long-term DRL lacks scalability, resilience, and flexibility. To bridge this gap, this paper introduces a new progressive approach that encompasses the planning and selection based on distributed optimization, as well as DRL-based flying direction scheduling. Extensive experiment with datasets generated from realisitic urban mobility demonstrate the outstanding performance of the proposed solution in traffic monitoring compared to three baseline methods.
    Redefining Super-Resolution: Fine-mesh PDE predictions without classical simulations. (arXiv:2311.09740v1 [physics.flu-dyn])
    In Computational Fluid Dynamics (CFD), coarse mesh simulations offer computational efficiency but often lack precision. Applying conventional super-resolution to these simulations poses a significant challenge due to the fundamental contrast between downsampling high-resolution images and authentically emulating low-resolution physics. The former method conserves more of the underlying physics, surpassing the usual constraints of real-world scenarios. We propose a novel definition of super-resolution tailored for PDE-based problems. Instead of simply downsampling from a high-resolution dataset, we use coarse-grid simulated data as our input and predict fine-grid simulated outcomes. Employing a physics-infused UNet upscaling method, we demonstrate its efficacy across various 2D-CFD problems such as discontinuity detection in Burger's equation, Methane combustion, and fouling in Industrial heat exchangers. Our method enables the generation of fine-mesh solutions bypassing traditional simulation, ensuring considerable computational saving and fidelity to the original ground truth outcomes. Through diverse boundary conditions during training, we further establish the robustness of our method, paving the way for its broad applications in engineering and scientific CFD solvers.
    Runtime Verification of Learning Properties for Reinforcement Learning Algorithms. (arXiv:2311.09811v1 [cs.LG])
    Reinforcement learning (RL) algorithms interact with their environment in a trial-and-error fashion. Such interactions can be expensive, inefficient, and timely when learning on a physical system rather than in a simulation. This work develops new runtime verification techniques to predict when the learning phase has not met or will not meet qualitative and timely expectations. This paper presents three verification properties concerning the quality and timeliness of learning in RL algorithms. With each property, we propose design steps for monitoring and assessing the properties during the system's operation.
    MUDGUARD: Taming Malicious Majorities in Federated Learning using Privacy-Preserving Byzantine-Robust Clustering. (arXiv:2208.10161v2 [cs.CR] UPDATED)
    Byzantine-robust Federated Learning (FL) aims to counter malicious clients and train an accurate global model while maintaining an extremely low attack success rate. Most existing systems, however, are only robust when most of the clients are honest. FLTrust (NDSS '21) and Zeno++ (ICML '20) do not make such an honest majority assumption but can only be applied to scenarios where the server is provided with an auxiliary dataset used to filter malicious updates. FLAME (USENIX '22) and EIFFeL (CCS '22) maintain the semi-honest majority assumption to guarantee robustness and the confidentiality of updates. It is therefore currently impossible to ensure Byzantine robustness and confidentiality of updates without assuming a semi-honest majority. To tackle this problem, we propose a novel Byzantine-robust and privacy-preserving FL system, called MUDGUARD, that can operate under malicious minority \emph{or majority} in both the server and client sides. Based on DBSCAN, we design a new method for extracting features from model updates via pairwise adjusted cosine similarity to boost the accuracy of the resulting clustering. To thwart attacks from a malicious majority, we develop a method called \textit{Model Segmentation}, that aggregates together only the updates from within a cluster, sending the corresponding model only to the clients of the corresponding cluster. The fundamental idea is that even if malicious clients are in their majority, their poisoned updates cannot harm benign clients if they are confined only within the malicious cluster. We also leverage multiple cryptographic tools to conduct clustering without sacrificing training correctness and updates confidentiality. We present a detailed security proof and empirical evaluation along with a convergence analysis for MUDGUARD.
    Safety Aware Autonomous Path Planning Using Model Predictive Reinforcement Learning for Inland Waterways. (arXiv:2311.09878v1 [cs.LG])
    In recent years, interest in autonomous shipping in urban waterways has increased significantly due to the trend of keeping cars and trucks out of city centers. Classical approaches such as Frenet frame based planning and potential field navigation often require tuning of many configuration parameters and sometimes even require a different configuration depending on the situation. In this paper, we propose a novel path planning approach based on reinforcement learning called Model Predictive Reinforcement Learning (MPRL). MPRL calculates a series of waypoints for the vessel to follow. The environment is represented as an occupancy grid map, allowing us to deal with any shape of waterway and any number and shape of obstacles. We demonstrate our approach on two scenarios and compare the resulting path with path planning using a Frenet frame and path planning based on a proximal policy optimization (PPO) agent. Our results show that MPRL outperforms both baselines in both test scenarios. The PPO based approach was not able to reach the goal in either scenario while the Frenet frame approach failed in the scenario consisting of a corner with obstacles. MPRL was able to safely (collision free) navigate to the goal in both of the test scenarios.
    Probabilities of the third type: Statistical Relational Learning and Reasoning with Relative Frequencies. (arXiv:2202.10367v2 [cs.AI] UPDATED)
    Dependencies on the relative frequency of a state in the domain are common when modelling probabilistic dependencies on relational data. For instance, the likelihood of a school closure during an epidemic might depend on the proportion of infected pupils exceeding a threshold. Often, rather than depending on discrete thresholds, dependencies are continuous: for instance, the likelihood of any one mosquito bite transmitting an illness depends on the proportion of carrier mosquitoes. Current approaches usually only consider probabilities over possible worlds rather than over domain elements themselves. An exception are the recently introduced Lifted Bayesian Networks for Conditional Probability Logic, which express discrete dependencies on probabilistic data. We introduce functional lifted Bayesian networks, a formalism that explicitly incorporates continuous dependencies on relative frequencies into statistical relational artificial intelligence. and compare and contrast them with ifted Bayesian Networks for Conditional Probability Logic. Incorporating relative frequencies is not only beneficial to modelling; it also provides a more rigorous approach to learning problems where training and test or application domains have different sizes. To this end, we provide a representation of the asymptotic probability distributions induced by functional lifted Bayesian networks on domains of increasing sizes. Since that representation has well-understood scaling behaviour across domain sizes, it can be used to estimate parameters for a large domain consistently from randomly sampled subpopulations. Furthermore, we show that in parametric families of FLBN, convergence is uniform in the parameters, which ensures a meaningful dependence of the asymptotic probabilities on the parameters of the model.
    Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse. (arXiv:2311.07468v2 [cs.CL] UPDATED)
    Recent studies have highlighted a phenomenon in large language models (LLMs) known as "the reversal curse," in which the order of knowledge entities in the training data biases the models' comprehension. For example, if a model is trained on sentences where entity A consistently appears before entity B, it can respond to queries about A by providing B as the answer. However, it may encounter confusion when presented with questions concerning B. We contend that the reversal curse is partially a result of specific model training objectives, particularly evident in the prevalent use of the next-token prediction within most causal language models. For the next-token prediction, models solely focus on a token's preceding context, resulting in a restricted comprehension of the input. In contrast, we illustrate that the GLM, trained using the autoregressive blank infilling objective where tokens to be predicted have access to the entire context, exhibits better resilience against the reversal curse. We propose a novel training method, BIdirectional Casual language modeling Optimization (BICO), designed to mitigate the reversal curse when fine-tuning pretrained causal language models on new data. BICO modifies the causal attention mechanism to function bidirectionally and employs a mask denoising optimization. In the task designed to assess the reversal curse, our approach improves Llama's accuracy from the original 0% to around 70%. We hope that more attention can be focused on exploring and addressing these inherent weaknesses of the current LLMs, in order to achieve a higher level of intelligence.
    Towards AI-controlled FES-restoration of movements: Learning cycling stimulation pattern with reinforcement learning. (arXiv:2303.09986v3 [cs.RO] UPDATED)
    Functional electrical stimulation (FES) has been increasingly integrated with other rehabilitation devices, including robots. FES cycling is one of the common FES applications in rehabilitation, which is performed by stimulating leg muscles in a certain pattern. The appropriate pattern varies across individuals and requires manual tuning which can be time-consuming and challenging for the individual user. Here, we present an AI-based method for finding the patterns, which requires no extra hardware or sensors. Our method has two phases, starting with finding model-based patterns using reinforcement learning and detailed musculoskeletal models. The models, built using open-source software, can be customised through our automated script and can be therefore used by non-technical individuals without extra cost. Next, our method fine-tunes the pattern using real cycling data. We test our both in simulation and experimentally on a stationary tricycle. In the simulation test, our method can robustly deliver model-based patterns for different cycling configurations. The experimental evaluation shows that our method can find a model-based pattern that induces higher cycling speed than an EMG-based pattern. By using just 100 seconds of cycling data, our method can deliver a fine-tuned pattern that gives better cycling performance. Beyond FES cycling, this work is a showcase, displaying the feasibility and potential of human-in-the-loop AI in real-world rehabilitation.
    Private estimation algorithms for stochastic block models and mixture models. (arXiv:2301.04822v2 [cs.DS] UPDATED)
    We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient $(\epsilon, \delta)$-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an $(\epsilon, \delta)$-differentially private algorithm that recovers the centers of the $k$-mixture when the minimum separation is at least $ O(k^{1/t}\sqrt{t})$. For all choices of $t$, this algorithm requires sample complexity $n\geq k^{O(1)}d^{O(t)}$ and time complexity $(nd)^{O(t)}$. Prior work required minimum separation at least $O(\sqrt{k})$ as well as an explicit upper bound on the Euclidean norm of the centers.
    Hijacking Large Language Models via Adversarial In-Context Learning. (arXiv:2311.09948v1 [cs.LG])
    In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific tasks by utilizing labeled examples as demonstrations in the precondition prompts. Despite its promising performance, ICL suffers from instability with the choice and arrangement of examples. Additionally, crafted adversarial attacks pose a notable threat to the robustness of ICL. However, existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL. To address these issues, this work introduces a novel transferable attack for ICL, aiming to hijack LLMs to generate the targeted response. The proposed LLM hijacking attack leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demonstrations. Extensive experimental results on various tasks and datasets demonstrate the effectiveness of our LLM hijacking attack, resulting in a distracted attention towards adversarial tokens, consequently leading to the targeted unwanted outputs.
    Learning to Reconstruct Signals From Binary Measurements. (arXiv:2303.08691v3 [eess.SP] UPDATED)
    Recent advances in unsupervised learning have highlighted the possibility of learning to reconstruct signals from noisy and incomplete linear measurements alone. These methods play a key role in medical and scientific imaging and sensing, where ground truth data is often scarce or difficult to obtain. However, in practice, measurements are not only noisy and incomplete but also quantized. Here we explore the extreme case of learning from binary observations and provide necessary and sufficient conditions on the number of measurements required for identifying a set of signals from incomplete binary data. Our results are complementary to existing bounds on signal recovery from binary measurements. Furthermore, we introduce a novel self-supervised learning approach, which we name SSBM, that only requires binary data for training. We demonstrate in a series of experiments with real datasets that SSBM performs on par with supervised learning and outperforms sparse reconstruction methods with a fixed wavelet basis by a large margin.
    SABAF: Removing Strong Attribute Bias from Neural Networks with Adversarial Filtering. (arXiv:2311.07141v2 [cs.LG] UPDATED)
    Ensuring a neural network is not relying on protected attributes (e.g., race, sex, age) for prediction is crucial in advancing fair and trustworthy AI. While several promising methods for removing attribute bias in neural networks have been proposed, their limitations remain under-explored. To that end, in this work, we mathematically and empirically reveal the limitation of existing attribute bias removal methods in presence of strong bias and propose a new method that can mitigate this limitation. Specifically, we first derive a general non-vacuous information-theoretical upper bound on the performance of any attribute bias removal method in terms of the bias strength, revealing that they are effective only when the inherent bias in the dataset is relatively weak. Next, we derive a necessary condition for the existence of any method that can remove attribute bias regardless of the bias strength. Inspired by this condition, we then propose a new method using an adversarial objective that directly filters out protected attributes in the input space while maximally preserving all other attributes, without requiring any specific target label. The proposed method achieves state-of-the-art performance in both strong and moderate bias settings. We provide extensive experiments on synthetic, image, and census datasets, to verify the derived theoretical bound and its consequences in practice, and evaluate the effectiveness of the proposed method in removing strong attribute bias.
    Having Beer after Prayer? Measuring Cultural Bias in Large Language Models. (arXiv:2305.14456v2 [cs.CL] UPDATED)
    It is important that language models appropriately adapt to specific cultural contexts. However, as we show in this paper, multilingual and Arabic monolingual language models default to Western culture even when prompted in Arabic and contextualized by an Arab cultural setting. To measure this Western bias, we introduce CAMeL, a dataset of naturally occurring Arabic prompts spanning eight diverse cultural aspects and an extensive list of 20,504 cultural targets corresponding to Arab or Western culture. Using CAMeL, we show that models favor Western targets and demonstrate cultural unfairness on downstream tasks such as named entity recognition and sentiment analysis. Our analyses of pretraining corpora also reveal that commonly used sources such as Wikipedia may not be suited to build culturally aware models, underscoring the importance of carefully curating pretraining data in constructing language models to serve a global population.
    HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM. (arXiv:2311.09528v1 [cs.CL])
    Existing open-source helpfulness preference datasets do not specify what makes some responses more helpful and others less so. Models trained on these datasets can incidentally learn to model dataset artifacts (e.g. preferring longer but unhelpful responses only due to their length). To alleviate this problem, we collect HelpSteer, a multi-attribute helpfulness dataset annotated for the various aspects that make responses helpful. Specifically, our 37k-sample dataset has annotations for correctness, coherence, complexity, and verbosity in addition to overall helpfulness of responses. Training Llama 2 70B using the HelpSteer dataset with SteerLM technique produces a model that scores 7.54 on MT Bench, which is currently the highest score for open models that do not require training data from more powerful models (e.g. GPT4). We release this dataset with CC-BY-4.0 license at https://huggingface.co/datasets/nvidia/HelpSteer
    Diffusion-Augmented Neural Processes. (arXiv:2311.09848v1 [cs.LG])
    Over the last few years, Neural Processes have become a useful modelling tool in many application areas, such as healthcare and climate sciences, in which data are scarce and prediction uncertainty estimates are indispensable. However, the current state of the art in the field (AR CNPs; Bruinsma et al., 2023) presents a few issues that prevent its widespread deployment. This work proposes an alternative, diffusion-based approach to NPs which, through conditioning on noised datasets, addresses many of these limitations, whilst also exceeding SOTA performance.
    Representational Strengths and Limitations of Transformers. (arXiv:2306.02896v2 [cs.LG] UPDATED)
    Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both positive and negative results on the representation power of attention layers, with a focus on intrinsic complexity parameters such as width, depth, and embedding dimension. On the positive side, we present a sparse averaging task, where recurrent networks and feedforward networks all have complexity scaling polynomially in the input size, whereas transformers scale merely logarithmically in the input size; furthermore, we use the same construction to show the necessity and role of a large embedding dimension in a transformer. On the negative side, we present a triple detection task, where attention layers in turn have complexity scaling linearly in the input size; as this scenario seems rare in practice, we also present natural variants that can be efficiently solved by attention layers. The proof techniques emphasize the value of communication complexity in the analysis of transformers and related models, and the role of sparse averaging as a prototypical attention task, which even finds use in the analysis of triple detection.
    Student of Games: A unified learning algorithm for both perfect and imperfect information games. (arXiv:2112.03178v2 [cs.AI] UPDATED)
    Games have a long history as benchmarks for progress in artificial intelligence. Approaches using search and learning produced strong performance across many perfect information games, and approaches using game-theoretic reasoning and learning demonstrated strong performance for specific imperfect information poker variants. We introduce Student of Games, a general-purpose algorithm that unifies previous approaches, combining guided search, self-play learning, and game-theoretic reasoning. Student of Games achieves strong empirical performance in large perfect and imperfect information games -- an important step towards truly general algorithms for arbitrary environments. We prove that Student of Games is sound, converging to perfect play as available computation and approximation capacity increases. Student of Games reaches strong performance in chess and Go, beats the strongest openly available agent in heads-up no-limit Texas hold'em poker, and defeats the state-of-the-art agent in Scotland Yard, an imperfect information game that illustrates the value of guided search, learning, and game-theoretic reasoning.
    Show Your Work with Confidence: Confidence Bands for Tuning Curves. (arXiv:2311.09480v1 [cs.CL])
    The choice of hyperparameters greatly impacts performance in natural language processing. Often, it is hard to tell if a method is better than another or just better tuned. Tuning curves fix this ambiguity by accounting for tuning effort. Specifically, they plot validation performance as a function of the number of hyperparameter choices tried so far. While several estimators exist for these curves, it is common to use point estimates, which we show fail silently and give contradictory results when given too little data. Beyond point estimates, confidence bands are necessary to rigorously establish the relationship between different approaches. We present the first method to construct valid confidence bands for tuning curves. The bands are exact, simultaneous, and distribution-free, thus they provide a robust basis for comparing methods. Empirical analysis shows that while bootstrap confidence bands, which serve as a baseline, fail to approximate their target confidence, ours achieve it exactly. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method. To promote confident comparisons in future work, we release a library implementing the method at https://github.com/nalourie/opda .
    Rethinking Fano's Inequality in Ensemble Learning. (arXiv:2205.12683v2 [cs.LG] UPDATED)
    We propose a fundamental theory on ensemble learning that answers the central question: what factors make an ensemble system good or bad? Previous studies used a variant of Fano's inequality of information theory and derived a lower bound of the classification error rate on the basis of the $\textit{accuracy}$ and $\textit{diversity}$ of models. We revisit the original Fano's inequality and argue that the studies did not take into account the information lost when multiple model predictions are combined into a final prediction. To address this issue, we generalize the previous theory to incorporate the information loss, which we name $\textit{combination loss}$. Further, we empirically validate and demonstrate the proposed theory through extensive experiments on actual systems. The theory reveals the strengths and weaknesses of systems on each metric, which will push the theoretical understanding of ensemble learning and give us insights into designing systems.
    EvoPrompting: Language Models for Code-Level Neural Architecture Search. (arXiv:2302.14838v3 [cs.NE] UPDATED)
    Given the recent impressive accomplishments of language models (LMs) for code generation, we explore the use of LMs as adaptive mutation and crossover operators for an evolutionary neural architecture search (NAS) algorithm. While NAS still proves too difficult a task for LMs to succeed at solely through prompting, we find that the combination of evolutionary prompt engineering with soft prompt-tuning, a method we term EvoPrompting, consistently finds diverse and high performing models. We first demonstrate that EvoPrompting is effective on the computationally efficient MNIST-1D dataset, where EvoPrompting produces convolutional architecture variants that outperform both those designed by human experts and naive few-shot prompting in terms of accuracy and model size. We then apply our method to searching for graph neural networks on the CLRS Algorithmic Reasoning Benchmark, where EvoPrompting is able to design novel architectures that outperform current state-of-the-art models on 21 out of 30 algorithmic reasoning tasks while maintaining similar model size. EvoPrompting is successful at designing accurate and efficient neural network architectures across a variety of machine learning tasks, while also being general enough for easy adaptation to other tasks beyond neural network design.
    ForkMerge: Mitigating Negative Transfer in Auxiliary-Task Learning. (arXiv:2301.12618v3 [cs.LG] UPDATED)
    Auxiliary-Task Learning (ATL) aims to improve the performance of the target task by leveraging the knowledge obtained from related tasks. Occasionally, learning multiple tasks simultaneously results in lower accuracy than learning only the target task, which is known as negative transfer. This problem is often attributed to the gradient conflicts among tasks, and is frequently tackled by coordinating the task gradients in previous works. However, these optimization-based methods largely overlook the auxiliary-target generalization capability. To better understand the root cause of negative transfer, we experimentally investigate it from both optimization and generalization perspectives. Based on our findings, we introduce ForkMerge, a novel approach that periodically forks the model into multiple branches, automatically searches the varying task weights by minimizing target validation errors, and dynamically merges all branches to filter out detrimental task-parameter updates. On a series of auxiliary-task learning benchmarks, ForkMerge outperforms existing methods and effectively mitigates negative transfer.
    Heterogeneous Graph Neural Networks using Self-supervised Reciprocally Contrastive Learning. (arXiv:2205.00256v2 [cs.LG] UPDATED)
    Heterogeneous graph neural network (HGNN) is a very popular technique for the modeling and analysis of heterogeneous graphs. Most existing HGNN-based approaches are supervised or semi-supervised learning methods requiring graphs to be annotated, which is costly and time-consuming. Self-supervised contrastive learning has been proposed to address the problem of requiring annotated data by mining intrinsic information hidden within the given data. However, the existing contrastive learning methods are inadequate for heterogeneous graphs because they construct contrastive views only based on data perturbation or pre-defined structural properties (e.g., meta-path) in graph data while ignore the noises that may exist in both node attributes and graph topologies. We develop for the first time a novel and robust heterogeneous graph contrastive learning approach, namely HGCL, which introduces two views on respective guidance of node attributes and graph topologies and integrates and enhances them by reciprocally contrastive mechanism to better model heterogeneous graphs. In this new approach, we adopt distinct but most suitable attribute and topology fusion mechanisms in the two views, which are conducive to mining relevant information in attributes and topologies separately. We further use both attribute similarity and topological correlation to construct high-quality contrastive samples. Extensive experiments on three large real-world heterogeneous graphs demonstrate the superiority and robustness of HGCL over state-of-the-art methods.
    Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning. (arXiv:2205.07394v2 [cs.AR] UPDATED)
    Hybrid storage systems (HSS) use multiple different storage devices to provide high and scalable storage capacity at high performance. Recent research proposes various techniques that aim to accurately identify performance-critical data to place it in a "best-fit" storage device. Unfortunately, most of these techniques are rigid, which (1) limits their adaptivity to perform well for a wide range of workloads and storage device configurations, and (2) makes it difficult for designers to extend these techniques to different storage system configurations (e.g., with a different number or different types of storage devices) than the configuration they are designed for. We introduce Sibyl, the first technique that uses reinforcement learning for data placement in hybrid storage systems. Sibyl observes different features of the running workload as well as the storage devices to make system-aware data placement decisions. For every decision it makes, Sibyl receives a reward from the system that it uses to evaluate the long-term performance impact of its decision and continuously optimizes its data placement policy online. We implement Sibyl on real systems with various HSS configurations. Our results show that Sibyl provides 21.6%/19.9% performance improvement in a performance-oriented/cost-oriented HSS configuration compared to the best previous data placement technique. Our evaluation using an HSS configuration with three different storage devices shows that Sibyl outperforms the state-of-the-art data placement policy by 23.9%-48.2%, while significantly reducing the system architect's burden in designing a data placement mechanism that can simultaneously incorporate three storage devices. We show that Sibyl achieves 80% of the performance of an oracle policy that has complete knowledge of future access patterns while incurring a very modest storage overhead of only 124.4 KiB.
    Reframing Audience Expansion through the Lens of Probability Density Estimation. (arXiv:2311.05853v1 [cs.AI] CROSS LISTED)
    Audience expansion has become an important element of prospective marketing, helping marketers create target audiences based on a mere representative sample of their current customer base. Within the realm of machine learning, a favored algorithm for scaling this sample into a broader audience hinges on a binary classification task, with class probability estimates playing a crucial role. In this paper, we review this technique and introduce a key change in how we choose training examples to ensure the quality of the generated audience. We present a simulation study based on the widely used MNIST dataset, where consistent high precision and recall values demonstrate our approach's ability to identify the most relevant users for an expanded audience. Our results are easily reproducible and a Python implementation is openly available on GitHub: \url{https://github.com/carvalhaes-ai/audience-expansion}
    Loss Modeling for Multi-Annotator Datasets. (arXiv:2311.00619v2 [cs.LG] UPDATED)
    Accounting for the opinions of all annotators of a dataset is critical for fairness. However, when annotating large datasets, individual annotators will frequently provide thousands of ratings which can lead to fatigue. Additionally, these annotation processes can occur over multiple days which can lead to an inaccurate representation of an annotator's opinion over time. To combat this, we propose to learn a more accurate representation of diverse opinions by utilizing multitask learning in conjunction with loss-based label correction. We show that using our novel formulation, we can cleanly separate agreeing and disagreeing annotations. Furthermore, we demonstrate that this modification can improve prediction performance in a single or multi-annotator setting. Lastly, we show that this method remains robust to additional label noise that is applied to subjective data.
    Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying. (arXiv:2311.09578v1 [cs.CL])
    We propose Tied-LoRA, a simple paradigm utilizes weight tying and selective training to further increase parameter efficiency of the Low-rank adaptation (LoRA) method. Our investigations include all feasible combinations parameter training/freezing in conjunction with weight tying to identify the optimal balance between performance and the number of trainable parameters. Through experiments covering a variety of tasks and two base language models, we provide analysis revealing trade-offs between efficiency and performance. Our experiments uncovered a particular Tied-LoRA configuration that stands out by demonstrating comparable performance across several tasks while employing only 13~\% percent of parameters utilized by the standard LoRA method.
    Investigating the Impact of Weight Sharing Decisions on Knowledge Transfer in Continual Learning. (arXiv:2311.09506v1 [cs.LG])
    Continual Learning (CL) has generated attention as a method of avoiding Catastrophic Forgetting (CF) in the sequential training of neural networks, improving network efficiency and adaptability to different tasks. Additionally, CL serves as an ideal setting for studying network behavior and Forward Knowledge Transfer (FKT) between tasks. Pruning methods for CL train subnetworks to handle the sequential tasks which allows us to take a structured approach to investigating FKT. Sharing prior subnetworks' weights leverages past knowledge for the current task through FKT. Understanding which weights to share is important as sharing all weights can yield sub-optimal accuracy. This paper investigates how different sharing decisions affect the FKT between tasks. Through this lens we demonstrate how task complexity and similarity influence the optimal weight sharing decisions, giving insights into the relationships between tasks and helping inform decision making in similar CL methods. We implement three sequential datasets designed to emphasize variation in task complexity and similarity, reporting results for both ResNet-18 and VGG-16. By sharing in accordance with the decisions supported by our findings, we show that we can improve task accuracy compared to other sharing decisions.
    Approximation Theory, Computing, and Deep Learning on the Wasserstein Space. (arXiv:2310.19548v2 [math.OC] UPDATED)
    The challenge of approximating functions in infinite-dimensional spaces from finite samples is widely regarded as formidable. In this study, we delve into the challenging problem of the numerical approximation of Sobolev-smooth functions defined on probability spaces. Our particular focus centers on the Wasserstein distance function, which serves as a relevant example. In contrast to the existing body of literature focused on approximating efficiently pointwise evaluations, we chart a new course to define functional approximants by adopting three machine learning-based approaches: 1. Solving a finite number of optimal transport problems and computing the corresponding Wasserstein potentials. 2. Employing empirical risk minimization with Tikhonov regularization in Wasserstein Sobolev spaces. 3. Addressing the problem through the saddle point formulation that characterizes the weak form of the Tikhonov functional's Euler-Lagrange equation. As a theoretical contribution, we furnish explicit and quantitative bounds on generalization errors for each of these solutions. In the proofs, we leverage the theory of metric Sobolev spaces and we combine it with techniques of optimal transport, variational calculus, and large deviation bounds. In our numerical implementation, we harness appropriately designed neural networks to serve as basis functions. These networks undergo training using diverse methodologies. This approach allows us to obtain approximating functions that can be rapidly evaluated after training. Consequently, our constructive solutions significantly enhance at equal accuracy the evaluation speed, surpassing that of state-of-the-art methods by several orders of magnitude.
    UMD: Unsupervised Model Detection for X2X Backdoor Attacks. (arXiv:2305.18651v4 [cs.LG] UPDATED)
    Backdoor (Trojan) attack is a common threat to deep neural networks, where samples from one or more source classes embedded with a backdoor trigger will be misclassified to adversarial target classes. Existing methods for detecting whether a classifier is backdoor attacked are mostly designed for attacks with a single adversarial target (e.g., all-to-one attack). To the best of our knowledge, without supervision, no existing methods can effectively address the more general X2X attack with an arbitrary number of source classes, each paired with an arbitrary target class. In this paper, we propose UMD, the first Unsupervised Model Detection method that effectively detects X2X backdoor attacks via a joint inference of the adversarial (source, target) class pairs. In particular, we first define a novel transferability statistic to measure and select a subset of putative backdoor class pairs based on a proposed clustering approach. Then, these selected class pairs are jointly assessed based on an aggregation of their reverse-engineered trigger size for detection inference, using a robust and unsupervised anomaly detector we proposed. We conduct comprehensive evaluations on CIFAR-10, GTSRB, and Imagenette dataset, and show that our unsupervised UMD outperforms SOTA detectors (even with supervision) by 17%, 4%, and 8%, respectively, in terms of the detection accuracy against diverse X2X attacks. We also show the strong detection performance of UMD against several strong adaptive attacks.
    MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification. (arXiv:2311.09761v1 [cs.CL])
    Fallacies can be used to spread disinformation, fake news, and propaganda, underlining the importance of their detection. Automated detection and classification of fallacies, however, remain challenging, mainly because of the innate subjectivity of the task and the need for a comprehensive, unified approach in existing research. Addressing these limitations, our study introduces a novel taxonomy of fallacies that aligns and refines previous classifications, a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity, adapted to precision, recall, and F1-Score metrics. Using our annotation scheme, the paper introduces MAFALDA (Multi-level Annotated FALlacy DAtaset), a gold standard dataset. MAFALDA is based on examples from various previously existing fallacy datasets under our unified taxonomy across three levels of granularity. We then evaluate several language models under a zero-shot learning setting using MAFALDA to assess their fallacy detection and classification capability. Our comprehensive evaluation not only benchmarks the performance of these models but also provides valuable insights into their strengths and limitations in addressing fallacious reasoning.
    Comparing Differentiable Logics for Learning Systems: A Research Preview. (arXiv:2311.09809v1 [cs.LO])
    Extensive research on formal verification of machine learning (ML) systems indicates that learning from data alone often fails to capture underlying background knowledge. A variety of verifiers have been developed to ensure that a machine-learnt model satisfies correctness and safety properties, however, these verifiers typically assume a trained network with fixed weights. ML-enabled autonomous systems are required to not only detect incorrect predictions, but should also possess the ability to self-correct, continuously improving and adapting. A promising approach for creating ML models that inherently satisfy constraints is to encode background knowledge as logical constraints that guide the learning process via so-called differentiable logics. In this research preview, we compare and evaluate various logics from the literature in weakly-supervised contexts, presenting our findings and highlighting open problems for future work. Our experimental results are broadly consistent with results reported previously in literature; however, learning with differentiable logics introduces a new hyperparameter that is difficult to tune and has significant influence on the effectiveness of the logics.
    On the Intrinsic Structures of Spiking Neural Networks. (arXiv:2207.04876v3 [cs.NE] UPDATED)
    Recent years have emerged a surge of interest in SNNs owing to their remarkable potential to handle time-dependent and event-driven data. The performance of SNNs hinges not only on selecting an apposite architecture and fine-tuning connection weights, similar to conventional ANNs, but also on the meticulous configuration of intrinsic structures within spiking computations. However, there has been a dearth of comprehensive studies examining the impact of intrinsic structures. Consequently, developers often find it challenging to apply a standardized configuration of SNNs across diverse datasets or tasks. This work delves deep into the intrinsic structures of SNNs. Initially, we unveil two pivotal components of intrinsic structures: the integration operation and firing-reset mechanism, by elucidating their influence on the expressivity of SNNs. Furthermore, we draw two key conclusions: the membrane time hyper-parameter is intimately linked to the eigenvalues of the integration operation, dictating the functional topology of spiking dynamics, and various hyper-parameters of the firing-reset mechanism govern the overall firing capacity of an SNN, mitigating the injection ratio or sampling density of input data. These findings elucidate why the efficacy of SNNs hinges heavily on the configuration of intrinsic structures and lead to a recommendation that enhancing the adaptability of these structures contributes to improving the overall performance and applicability of SNNs. Inspired by this recognition, we propose two feasible approaches to enhance SNN learning. These involve leveraging self-connection architectures and employing stochastic spiking neurons to augment the adaptability of the integration operation and firing-reset mechanism, respectively. We verify the effectiveness of the proposed methods from perspectives of theory and practice.
    Self-supervised learning of multi-omics embeddings in the low-label, high-data regime. (arXiv:2311.09962v1 [cs.LG])
    Contrastive, self-supervised learning (SSL) is used to train a model that predicts cancer type from miRNA, mRNA or RPPA expression data. This model, a pretrained FT-Transformer, is shown to outperform XGBoost and CatBoost, standard benchmarks for tabular data, when labelled samples are scarce but the number of unlabelled samples is high. This is despite the fact that the datasets we use have $\mathcal{O}(10^{1})$ classes and $\mathcal{O}(10^{2})-\mathcal{O}(10^{4})$ features. After demonstrating the efficacy of our chosen method of self-supervised pretraining, we investigate SSL for multi-modal models. A late-fusion model is proposed, where each omics is passed through its own sub-network, the outputs of which are averaged and passed to the pretraining or downstream objective function. Multi-modal pretraining is shown to improve predictions from a single omics, and we argue that this is useful for datasets with many unlabelled multi-modal samples, but few labelled unimodal samples. Additionally, we show that pretraining each omics-specific module individually is highly effective. This enables the application of the proposed model in a variety of contexts where a large amount of unlabelled data is available from each omics, but only a few labelled samples.
    SurgPLAN: Surgical Phase Localization Network for Phase Recognition. (arXiv:2311.09965v1 [cs.CV])
    Surgical phase recognition is crucial to providing surgery understanding in smart operating rooms. Despite great progress in automatic surgical phase recognition, most existing methods are still restricted by two problems. First, these methods cannot capture discriminative visual features for each frame and motion information with simple 2D networks. Second, the frame-by-frame recognition paradigm degrades the performance due to unstable predictions within each phase, termed as phase shaking. To address these two challenges, we propose a Surgical Phase LocAlization Network, named SurgPLAN, to facilitate a more accurate and stable surgical phase recognition with the principle of temporal detection. Specifically, we first devise a Pyramid SlowFast (PSF) architecture to serve as the visual backbone to capture multi-scale spatial and temporal features by two branches with different frame sampling rates. Moreover, we propose a Temporal Phase Localization (TPL) module to generate the phase prediction based on temporal region proposals, which ensures accurate and consistent predictions within each surgical phase. Extensive experiments confirm the significant advantages of our SurgPLAN over frame-by-frame approaches in terms of both accuracy and stability.
    Score-based generative models learn manifold-like structures with constrained mixing. (arXiv:2311.09952v1 [stat.ML])
    How do score-based generative models (SBMs) learn the data distribution supported on a low-dimensional manifold? We investigate the score model of a trained SBM through its linear approximations and subspaces spanned by local feature vectors. During diffusion as the noise decreases, the local dimensionality increases and becomes more varied between different sample sequences. Importantly, we find that the learned vector field mixes samples by a non-conservative field within the manifold, although it denoises with normal projections as if there is an energy function in off-manifold directions. At each noise level, the subspace spanned by the local features overlap with an effective density function. These observations suggest that SBMs can flexibly mix samples with the learned score field while carefully maintaining a manifold-like structure of the data distribution.
    Comprehensive Evaluation and Insights into the Use of Deep Neural Networks to Detect and Quantify Lymphoma Lesions in PET/CT Images. (arXiv:2311.09614v1 [cs.CV])
    This study performs comprehensive evaluation of four neural network architectures (UNet, SegResNet, DynUNet, and SwinUNETR) for lymphoma lesion segmentation from PET/CT images. These networks were trained, validated, and tested on a diverse, multi-institutional dataset of 611 cases. Internal testing (88 cases; total metabolic tumor volume (TMTV) range [0.52, 2300] ml) showed SegResNet as the top performer with a median Dice similarity coefficient (DSC) of 0.76 and median false positive volume (FPV) of 4.55 ml; all networks had a median false negative volume (FNV) of 0 ml. On the unseen external test set (145 cases with TMTV range: [0.10, 2480] ml), SegResNet achieved the best median DSC of 0.68 and FPV of 21.46 ml, while UNet had the best FNV of 0.41 ml. We assessed reproducibility of six lesion measures, calculated their prediction errors, and examined DSC performance in relation to these lesion measures, offering insights into segmentation accuracy and clinical relevance. Additionally, we introduced three lesion detection criteria, addressing the clinical need for identifying lesions, counting them, and segmenting based on metabolic characteristics. We also performed expert intra-observer variability analysis revealing the challenges in segmenting ``easy'' vs. ``hard'' cases, to assist in the development of more resilient segmentation algorithms. Finally, we performed inter-observer agreement assessment underscoring the importance of a standardized ground truth segmentation protocol involving multiple expert annotators. Code is available at: https://github.com/microsoft/lymphoma-segmentation-dnn
    Prudent Silence or Foolish Babble? Examining Large Language Models' Responses to the Unknown. (arXiv:2311.09731v1 [cs.CL])
    Large Language Models (LLMs) often struggle when faced with situations where they lack the prerequisite knowledge to generate a sensical response. In these cases, models tend to fabricate and hallucinate, rather than appropriately signaling uncertainty as humans would. This behavior misaligns with human conversational norms and presents challenges surrounding responsible and ethical AI development. This work aims to systematically investigate LLMs' behaviors in such situations. We curate an adversarial question-answering benchmark containing unanswerable questions targeting information absent from the LLM's training data. Concretely, these unanswerable questions contain non-existent concepts or false premises. When presented with such unanswerable questions, an LLM should appropriately convey uncertainty, and be able to challenge the premise and refuse to generate a response. While facing answerable valid questions, a model should demonstrate a positive correlation between accuracy and confidence. Using a model-agnostic unified confidence elicitation approach, we observe that LLMs that have gone through instruction finetuning and reinforcement learning from human feedback (RLHF) perform significantly better than their counterparts that do not. Moreover, uncertainty expression 1 through our elicitation method does not always stay consistent with the perceived confidence of the direct response of an LLM. Our findings call for further research into teaching LLMs to proactively and reliably express uncertainty.
    Fair Data Representation for Machine Learning at the Pareto Frontier. (arXiv:2201.00292v3 [stat.ML] UPDATED)
    As machine learning powered decision-making becomes increasingly important in our daily lives, it is imperative to strive for fairness in the underlying data processing. We propose a pre-processing algorithm for fair data representation via which supervised learning results in estimations of the Pareto frontier between prediction error and statistical disparity. Particularly, the present work applies the optimal affine transport to approach the post-processing Wasserstein-2 barycenter characterization of the optimal fair $L^2$-objective supervised learning via a pre-processing data deformation. Furthermore, we show that the Wasserstein-2 geodesics from the conditional (on sensitive information) distributions of the learning outcome to their barycenter characterizes the Pareto frontier between $L^2$-loss and the average pairwise Wasserstein-2 distance among sensitive groups on the learning outcome. Numerical simulations underscore the advantages: (1) the pre-processing step is compositive with arbitrary conditional expectation estimation supervised learning methods and unseen data; (2) the fair representation protects the sensitive information by limiting the inference capability of the remaining data with respect to the sensitive data; (3) the optimal affine maps are computationally efficient even for high-dimensional data.
    Residual-Based Error Corrector Operator to Enhance Accuracy and Reliability of Neural Operator Surrogates of Nonlinear Variational Boundary-Value Problems. (arXiv:2306.12047v3 [math.NA] UPDATED)
    This work focuses on developing methods for approximating the solution operators of a class of parametric partial differential equations via neural operators. Neural operators have several challenges, including the issue of generating appropriate training data, cost-accuracy trade-offs, and nontrivial hyperparameter tuning. The unpredictability of the accuracy of neural operators impacts their applications in downstream problems of inference, optimization, and control. A framework based on the linear variational problem that gives the correction to the prediction furnished by neural operators is considered based on earlier work in JCP 486 (2023) 112104. The operator, called Residual-based Error Corrector Operator or simply Corrector Operator, associated with the corrector problem is analyzed further. Numerical results involving a nonlinear reaction-diffusion model in two dimensions with PCANet-type neural operators show almost two orders of increase in the accuracy of approximations when neural operators are corrected using the correction scheme. Further, topology optimization involving a nonlinear reaction-diffusion model is considered to highlight the limitations of neural operators and the efficacy of the correction scheme. Optimizers with neural operator surrogates are seen to make significant errors (as high as 80 percent). However, the errors are much lower (below 7 percent) when neural operators are corrected.
    Overcoming Data Scarcity in Biomedical Imaging with a Foundational Multi-Task Model. (arXiv:2311.09847v1 [cs.CV])
    Foundational models, pretrained on a large scale, have demonstrated substantial success across non-medical domains. However, training these models typically requires large, comprehensive datasets, which contrasts with the smaller and more heterogeneous datasets common in biomedical imaging. Here, we propose a multi-task learning strategy that decouples the number of training tasks from memory requirements. We trained a Universal bioMedical PreTrained model (UMedPT) on a multi-task database including tomographic, microscopic, and X-ray images, with various labelling strategies such as classification, segmentation, and object detection. The UMedPT foundational model outperformed ImageNet pretraining and the previous state-of-the-art models. For tasks related to the pretraining database, it maintained its performance with only 1% of the original training data and without fine-tuning. For out-of-domain tasks it required not more than 50% of the original training data. In an external independent validation imaging features extracted using UMedPT proved to be a new standard for cross-center transferability.
    Gradients Look Alike: Sensitivity is Often Overestimated in DP-SGD. (arXiv:2307.00310v2 [cs.LG] UPDATED)
    Differentially private stochastic gradient descent (DP-SGD) is the canonical approach to private deep learning. While the current privacy analysis of DP-SGD is known to be tight in some settings, several empirical results suggest that models trained on common benchmark datasets leak significantly less privacy for many datapoints. Yet, despite past attempts, a rigorous explanation for why this is the case has not been reached. Is it because there exist tighter privacy upper bounds when restricted to these dataset settings, or are our attacks not strong enough for certain datapoints? In this paper, we provide the first per-instance (i.e., ``data-dependent") DP analysis of DP-SGD. Our analysis captures the intuition that points with similar neighbors in the dataset enjoy better data-dependent privacy than outliers. Formally, this is done by modifying the per-step privacy analysis of DP-SGD to introduce a dependence on the distribution of model updates computed from a training dataset. We further develop a new composition theorem to effectively use this new per-step analysis to reason about an entire training run. Put all together, our evaluation shows that this novel DP-SGD analysis allows us to now formally show that DP-SGD leaks significantly less privacy for many datapoints (when trained on common benchmarks) than the current data-independent guarantee. This implies privacy attacks will necessarily fail against many datapoints if the adversary does not have sufficient control over the possible training datasets.
    A Framework for Monitoring and Retraining Language Models in Real-World Applications. (arXiv:2311.09930v1 [cs.LG])
    In the Machine Learning (ML) model development lifecycle, training candidate models using an offline holdout dataset and identifying the best model for the given task is only the first step. After the deployment of the selected model, continuous model monitoring and model retraining is required in many real-world applications. There are multiple reasons for retraining, including data or concept drift, which may be reflected on the model performance as monitored by an appropriate metric. Another motivation for retraining is the acquisition of increasing amounts of data over time, which may be used to retrain and improve the model performance even in the absence of drifts. We examine the impact of various retraining decision points on crucial factors, such as model performance and resource utilization, in the context of Multilabel Classification models. We explain our key decision points and propose a reference framework for designing an effective model retraining strategy.
    Breaking Boundaries: Balancing Performance and Robustness in Deep Wireless Traffic Forecasting. (arXiv:2311.09790v1 [cs.LG])
    Balancing the trade-off between accuracy and robustness is a long-standing challenge in time series forecasting. While most of existing robust algorithms have achieved certain suboptimal performance on clean data, sustaining the same performance level in the presence of data perturbations remains extremely hard. % In this paper, we study a wide array of perturbation scenarios and propose novel defense mechanisms against adversarial attacks using real-world telecom data. We compare our strategy against two existing adversarial training algorithms under a range of maximal allowed perturbations, defined using $\ell_{\infty}$-norm, $\in [0.1,0.4]$. % Our findings reveal that our hybrid strategy, which is composed of a classifier to detect adversarial examples, a denoiser to eliminate noise from the perturbed data samples, and a standard forecaster, achieves the best performance on both clean and perturbed data. % Our optimal model can retain up to $92.02\%$ the performance of the original forecasting model in terms of Mean Squared Error (MSE) on clean data, while being more robust than the standard adversarially trained models on perturbed data. Its MSE is 2.71$\times$ and 2.51$\times$ lower than those of comparing methods on normal and perturbed data, respectively. In addition, the components of our models can be trained in parallel, resulting in better computational efficiency. % Our results indicate that we can optimally balance the trade-off between the performance and robustness of forecasting models by improving the classifier and denoiser, even in the presence of sophisticated and destructive poisoning attacks.
    GAIA: Delving into Gradient-based Attribution Abnormality for Out-of-distribution Detection. (arXiv:2311.09620v1 [cs.LG])
    Detecting out-of-distribution (OOD) examples is crucial to guarantee the reliability and safety of deep neural networks in real-world settings. In this paper, we offer an innovative perspective on quantifying the disparities between in-distribution (ID) and OOD data -- analyzing the uncertainty that arises when models attempt to explain their predictive decisions. This perspective is motivated by our observation that gradient-based attribution methods encounter challenges in assigning feature importance to OOD data, thereby yielding divergent explanation patterns. Consequently, we investigate how attribution gradients lead to uncertain explanation outcomes and introduce two forms of abnormalities for OOD detection: the zero-deflation abnormality and the channel-wise average abnormality. We then propose GAIA, a simple and effective approach that incorporates Gradient Abnormality Inspection and Aggregation. The effectiveness of GAIA is validated on both commonly utilized (CIFAR) and large-scale (ImageNet-1k) benchmarks. Specifically, GAIA reduces the average FPR95 by 23.10% on CIFAR10 and by 45.41% on CIFAR100 compared to advanced post-hoc methods.
    Aligning with Whom? Large Language Models Have Gender and Racial Biases in Subjective NLP Tasks. (arXiv:2311.09730v1 [cs.CL])
    Human perception of language depends on personal backgrounds like gender and ethnicity. While existing studies have shown that large language models (LLMs) hold values that are closer to certain societal groups, it is unclear whether their prediction behaviors on subjective NLP tasks also exhibit a similar bias. In this study, leveraging the POPQUORN dataset which contains annotations of diverse demographic backgrounds, we conduct a series of experiments on four popular LLMs to investigate their capability to understand group differences and potential biases in their predictions for politeness and offensiveness. We find that for both tasks, model predictions are closer to the labels from White and female participants. We further explore prompting with the target demographic labels and show that including the target demographic in the prompt actually worsens the model's performance. More specifically, when being prompted to respond from the perspective of "Black" and "Asian" individuals, models show lower performance in predicting both overall scores as well as the scores from corresponding groups. Our results suggest that LLMs hold gender and racial biases for subjective NLP tasks and that demographic-infused prompts alone may be insufficient to mitigate such effects. Code and data are available at https://github.com/Jiaxin-Pei/LLM-Group-Bias.
    Modelling daily mobility using mobile data traffic at fine spatiotemporal scale. (arXiv:2311.09683v1 [cs.LG])
    We applied a data-driven approach that explores the usability of the NetMob 2023 dataset in modelling mobility patterns within an urban context. We combined the data with a highly suitable external source, the ENACT dataset, which provides a 1 km x 1km grid with estimates of the day and night population across Europe. We developed three sets of XGBoost models that predict the population in each 100m x 100m grid cell used in NetMob2023 based on the mobile data traffic of the 68 online services covered in the dataset, using the ENACT values as ground truth. The results suggest that the NetMob 2023 data can be useful for the estimation of the day and night population and grid cell level and can explain part of the dynamics of urban mobility.
    AMLB: an AutoML Benchmark. (arXiv:2207.12560v2 [cs.LG] UPDATED)
    Comparing different AutoML frameworks is notoriously challenging and often done incorrectly. We introduce an open and extensible benchmark that follows best practices and avoids common mistakes when comparing AutoML frameworks. We conduct a thorough comparison of 9 well-known AutoML frameworks across 71 classification and 33 regression tasks. The differences between the AutoML frameworks are explored with a multi-faceted analysis, evaluating model accuracy, its trade-offs with inference time, and framework failures. We also use Bradley-Terry trees to discover subsets of tasks where the relative AutoML framework rankings differ. The benchmark comes with an open-source tool that integrates with many AutoML frameworks and automates the empirical evaluation process end-to-end: from framework installation and resource allocation to in-depth evaluation. The benchmark uses public data sets, can be easily extended with other AutoML frameworks and tasks, and has a website with up-to-date results.
    Adaptive Interventions with User-Defined Goals for Health Behavior Change. (arXiv:2311.09483v1 [cs.LG])
    Physical inactivity remains a major public health concern, having associations with adverse health outcomes such as cardiovascular disease and type-2 diabetes. Mobile health applications present a promising avenue for low-cost, scalable physical activity promotion, yet often suffer from small effect sizes and low adherence rates, particularly in comparison to human coaching. Goal-setting is a critical component of health coaching that has been underutilized in adaptive algorithms for mobile health interventions. This paper introduces a modification to the Thompson sampling algorithm that places emphasis on individualized goal-setting by optimizing personalized reward functions. As a step towards supporting goal-setting, this paper offers a balanced approach that can leverage shared structure while optimizing individual preferences and goals. We prove that our modification incurs only a constant penalty on the cumulative regret while preserving the sample complexity benefits of data sharing. In a physical activity simulator, we demonstrate that our algorithm achieves substantial improvements in cumulative regret compared to baselines that do not share data or do not optimize for individualized rewards.
    Improving the Generation Quality of Watermarked Large Language Models via Word Importance Scoring. (arXiv:2311.09668v1 [cs.CL])
    The strong general capabilities of Large Language Models (LLMs) bring potential ethical risks if they are unrestrictedly accessible to malicious users. Token-level watermarking inserts watermarks in the generated texts by altering the token probability distributions with a private random number generator seeded by its prefix tokens. However, this watermarking algorithm alters the logits during generation, which can lead to a downgraded text quality if it chooses to promote tokens that are less relevant given the input. In this work, we propose to improve the quality of texts generated by a watermarked language model by Watermarking with Importance Scoring (WIS). At each generation step, we estimate the importance of the token to generate, and prevent it from being impacted by watermarking if it is important for the semantic correctness of the output. We further propose three methods to predict importance scoring, including a perturbation-based method and two model-based methods. Empirical experiments show that our method can generate texts with better quality with comparable level of detection rate.
    FedCode: Communication-Efficient Federated Learning via Transferring Codebooks. (arXiv:2311.09270v1 [cs.LG])
    Federated Learning (FL) is a distributed machine learning paradigm that enables learning models from decentralized local data. While FL offers appealing properties for clients' data privacy, it imposes high communication burdens for exchanging model weights between a server and the clients. Existing approaches rely on model compression techniques, such as pruning and weight clustering to tackle this. However, transmitting the entire set of weight updates at each federated round, even in a compressed format, limits the potential for a substantial reduction in communication volume. We propose FedCode where clients transmit only codebooks, i.e., the cluster centers of updated model weight values. To ensure a smooth learning curve and proper calibration of clusters between the server and the clients, FedCode periodically transfers model weights after multiple rounds of solely communicating codebooks. This results in a significant reduction in communication volume between clients and the server in both directions, without imposing significant computational overhead on the clients or leading to major performance degradation of the models. We evaluate the effectiveness of FedCode using various publicly available datasets with ResNet-20 and MobileNet backbone model architectures. Our evaluations demonstrate a 12.2-fold data transmission reduction on average while maintaining a comparable model performance with an average accuracy loss of 1.3% compared to FedAvg. Further validation of FedCode performance under non-IID data distributions showcased an average accuracy loss of 2.0% compared to FedAvg while achieving approximately a 12.7-fold data transmission reduction.
    GEO: Generative Engine Optimization. (arXiv:2311.09735v1 [cs.LG])
    The advent of large language models (LLMs) has ushered in a new paradigm of search engines that use generative models to gather and summarize information to answer user queries. This emerging technology, which we formalize under the unified framework of Generative Engines (GEs), has the potential to generate accurate and personalized responses, and is rapidly replacing traditional search engines like Google and Bing. Generative Engines typically satisfy queries by synthesizing information from multiple sources and summarizing them with the help of LLMs. While this shift significantly improves \textit{user} utility and \textit{generative search engine} traffic, it results in a huge challenge for the third stakeholder -- website and content creators. Given the black-box and fast-moving nature of Generative Engines, content creators have little to no control over when and how their content is displayed. With generative engines here to stay, the right tools should be provided to ensure that creator economy is not severely disadvantaged. To address this, we introduce Generative Engine Optimization (GEO), a novel paradigm to aid content creators in improving the visibility of their content in Generative Engine responses through a black-box optimization framework for optimizing and defining visibility metrics. We facilitate systematic evaluation in this new paradigm by introducing GEO-bench, a benchmark of diverse user queries across multiple domains, coupled with sources required to answer these queries. Through rigorous evaluation, we show that GEO can boost visibility by up to 40\% in generative engine responses. Moreover, we show the efficacy of these strategies varies across domains, underscoring the need for domain-specific methods. Our work opens a new frontier in the field of information discovery systems, with profound implications for generative engines and content creators.
    Constructing interpretable principal curve using Neural ODEs. (arXiv:2311.09274v1 [cs.LG])
    The study of high dimensional data sets often rely on their low dimensional projections that preserve the local geometry of the original space. While numerous methods have been developed to summarize this space as variations of tree-like structures, they are usually non-parametric and "static" in nature. As data may come from systems that are dynamical such as a differentiating cell, a static, non-parametric characterization of the space may not be the most appropriate. Here, we developed a framework, the principal flow, that is capable of characterizing the space in a dynamical manner. The principal flow, defined using neural ODEs, directs motion of a particle through the space, where the trajectory of the particle resembles the principal curve of the dataset. We illustrate that our framework can be used to characterize shapes of various complexities, and is flexible to incorporate summaries of relaxation dynamics.
    A Comparative Analysis of Machine Learning Models for Early Detection of Hospital-Acquired Infections. (arXiv:2311.09329v1 [cs.LG])
    As more and more infection-specific machine learning models are developed and planned for clinical deployment, simultaneously running predictions from different models may provide overlapping or even conflicting information. It is important to understand the concordance and behavior of parallel models in deployment. In this study, we focus on two models for the early detection of hospital-acquired infections (HAIs): 1) the Infection Risk Index (IRI) and 2) the Ventilator-Associated Pneumonia (VAP) prediction model. The IRI model was built to predict all HAIs, whereas the VAP model identifies patients at risk of developing ventilator-associated pneumonia. These models could make important improvements in patient outcomes and hospital management of infections through early detection of infections and in turn, enable early interventions. The two models vary in terms of infection label definition, cohort selection, and prediction schema. In this work, we present a comparative analysis between the two models to characterize concordances and confusions in predicting HAIs by these models. The learnings from this study will provide important findings for how to deploy multiple concurrent disease-specific models in the future.
    Infinite Width Graph Neural Networks for Node Regression/ Classification. (arXiv:2310.08176v3 [cs.LG] UPDATED)
    This work analyzes Graph Neural Networks, a generalization of Fully-Connected Deep Neural Nets on Graph structured data, when their width, that is the number of nodes in each fullyconnected layer is increasing to infinity. Infinite Width Neural Networks are connecting Deep Learning to Gaussian Processes and Kernels, both Machine Learning Frameworks with long traditions and extensive theoretical foundations. Gaussian Processes and Kernels have much less hyperparameters then Neural Networks and can be used for uncertainty estimation, making them more user friendly for applications. This works extends the increasing amount of research connecting Gaussian Processes and Kernels to Neural Networks. The Kernel and Gaussian Process closed forms are derived for a variety of architectures, namely the standard Graph Neural Network, the Graph Neural Network with Skip-Concatenate Connections and the Graph Attention Neural Network. All architectures are evaluated on a variety of datasets on the task of transductive Node Regression and Classification. Additionally, a Spectral Sparsification method known as Effective Resistance is used to improve runtime and memory requirements. Extending the setting to inductive graph learning tasks (Graph Regression/ Classification) is straightforward and is briefly discussed in 3.5.
    Neural machine translation for automated feedback on children's early-stage writing. (arXiv:2311.09389v1 [cs.CL])
    In this work, we address the problem of assessing and constructing feedback for early-stage writing automatically using machine learning. Early-stage writing is typically vastly different from conventional writing due to phonetic spelling and lack of proper grammar, punctuation, spacing etc. Consequently, early-stage writing is highly non-trivial to analyze using common linguistic metrics. We propose to use sequence-to-sequence models for "translating" early-stage writing by students into "conventional" writing, which allows the translated text to be analyzed using linguistic metrics. Furthermore, we propose a novel robust likelihood to mitigate the effect of noise in the dataset. We investigate the proposed methods using a set of numerical experiments and demonstrate that the conventional text can be predicted with high accuracy.
    Linear time Evidence Accumulation Clustering with KMeans. (arXiv:2311.09272v1 [cs.LG])
    Among ensemble clustering methods, Evidence Accumulation Clustering is one of the simplest technics. In this approach, a co-association (CA) matrix representing the co-clustering frequency is built and then clustered to extract consensus clusters. Compared to other approaches, this one is simple as there is no need to find matches between clusters obtained from two different partitionings. Nevertheless, this method suffers from computational issues, as it requires to compute and store a matrix of size n x n, where n is the number of items. Due to the quadratic cost, this approach is reserved for small datasets. This work describes a trick which mimic the behavior of average linkage clustering. We found a way of computing efficiently the density of a partitioning, reducing the cost from a quadratic to linear complexity. Additionally, we proved that the k-means maximizes naturally the density. We performed experiments on several benchmark datasets where we compared the k-means and the bisecting version to other state-of-the-art consensus algorithms. The k-means results are comparable to the best state of the art in terms of NMI while keeping the computational cost low. Additionally, the k-means led to the best results in terms of density. These results provide evidence that consensus clustering can be solved with simple algorithms.
    Correlation-aware Spatial-Temporal Graph Learning for Multivariate Time-series Anomaly Detection. (arXiv:2307.08390v2 [cs.LG] UPDATED)
    Multivariate time-series anomaly detection is critically important in many applications, including retail, transportation, power grid, and water treatment plants. Existing approaches for this problem mostly employ either statistical models which cannot capture the non-linear relations well or conventional deep learning models (e.g., CNN and LSTM) that do not explicitly learn the pairwise correlations among variables. To overcome these limitations, we propose a novel method, correlation-aware spatial-temporal graph learning (termed CST-GL), for time series anomaly detection. CST-GL explicitly captures the pairwise correlations via a multivariate time series correlation learning module based on which a spatial-temporal graph neural network (STGNN) can be developed. Then, by employing a graph convolution network that exploits one- and multi-hop neighbor information, our STGNN component can encode rich spatial information from complex pairwise dependencies between variables. With a temporal module that consists of dilated convolutional functions, the STGNN can further capture long-range dependence over time. A novel anomaly scoring component is further integrated into CST-GL to estimate the degree of an anomaly in a purely unsupervised manner. Experimental results demonstrate that CST-GL can detect anomalies effectively in general settings as well as enable early detection across different time delays.
    PWISeg: Point-based Weakly-supervised Instance Segmentation for Surgical Instruments. (arXiv:2311.09819v1 [cs.CV])
    In surgical procedures, correct instrument counting is essential. Instance segmentation is a location method that locates not only an object's bounding box but also each pixel's specific details. However, obtaining mask-level annotations is labor-intensive in instance segmentation. To address this issue, we propose a novel yet effective weakly-supervised surgical instrument instance segmentation approach, named Point-based Weakly-supervised Instance Segmentation (PWISeg). PWISeg adopts an FCN-based architecture with point-to-box and point-to-mask branches to model the relationships between feature points and bounding boxes, as well as feature points and segmentation masks on FPN, accomplishing instrument detection and segmentation jointly in a single model. Since mask level annotations are hard to available in the real world, for point-to-mask training, we introduce an unsupervised projection loss, utilizing the projected relation between predicted masks and bboxes as supervision signal. On the other hand, we annotate a few pixels as the key pixel for each instrument. Based on this, we further propose a key pixel association loss and a key pixel distribution loss, driving the point-to-mask branch to generate more accurate segmentation predictions. To comprehensively evaluate this task, we unveil a novel surgical instrument dataset with manual annotations, setting up a benchmark for further research. Our comprehensive research trial validated the superior performance of our PWISeg. The results show that the accuracy of surgical instrument segmentation is improved, surpassing most methods of instance segmentation via weakly supervised bounding boxes. This improvement is consistently observed in our proposed dataset and when applied to the public HOSPI-Tools dataset.
    Shared Growth of Graph Neural Networks via Prompted Free-direction Knowledge Distillation. (arXiv:2307.00534v3 [cs.LG] UPDATED)
    Knowledge distillation (KD) has shown to be effective to boost the performance of graph neural networks (GNNs), where the typical objective is to distill knowledge from a deeper teacher GNN into a shallower student GNN. However, it is often quite challenging to train a satisfactory deeper GNN due to the well-known over-parametrized and over-smoothing issues, leading to invalid knowledge transfer in practical applications. In this paper, we propose the first Free-direction Knowledge Distillation framework via reinforcement learning for GNNs, called FreeKD, which is no longer required to provide a deeper well-optimized teacher GNN. Our core idea is to collaboratively learn two shallower GNNs to exchange knowledge between them. As we observe that one typical GNN model often exhibits better and worse performances at different nodes during training, we devise a dynamic and free-direction knowledge transfer strategy that involves two levels of actions: 1) node-level action determines the directions of knowledge transfer between the corresponding nodes of two networks; and then 2) structure-level action determines which of the local structures generated by the node-level actions to be propagated. Additionally, considering that different augmented graphs can potentially capture distinct perspectives of the graph data, we propose FreeKD-Prompt that learns undistorted and diverse augmentations based on prompt learning for exchanging varied knowledge. Furthermore, instead of confining knowledge exchange within two GNNs, we develop FreeKD++ to enable free-direction knowledge transfer among multiple GNNs. Extensive experiments on five benchmark datasets demonstrate our approaches outperform the base GNNs in a large margin. More surprisingly, our FreeKD has comparable or even better performance than traditional KD algorithms that distill knowledge from a deeper and stronger teacher GNN.
    Open Problem: Learning with Variational Objectives on Measures. (arXiv:2306.11928v2 [stat.ML] UPDATED)
    The theory of statistical learning has focused on variational objectives expressed on functions. In this note, we discuss motivations to write similar objectives on measures, in particular to discuss out-of-distribution generalization and weakly-supervised learning. It raises a natural question: can one cast usual statistical learning results to objectives expressed on measures? Does the resulting construction lead to new algorithms of practical interest?
    Neuroscience inspired scientific machine learning (Part-1): Variable spiking neuron for regression. (arXiv:2311.09267v1 [cs.NE])
    Redundant information transfer in a neural network can increase the complexity of the deep learning model, thus increasing its power consumption. We introduce in this paper a novel spiking neuron, termed Variable Spiking Neuron (VSN), which can reduce the redundant firing using lessons from biological neuron inspired Leaky Integrate and Fire Spiking Neurons (LIF-SN). The proposed VSN blends LIF-SN and artificial neurons. It garners the advantage of intermittent firing from the LIF-SN and utilizes the advantage of continuous activation from the artificial neuron. This property of the proposed VSN makes it suitable for regression tasks, which is a weak point for the vanilla spiking neurons, all while keeping the energy budget low. The proposed VSN is tested against both classification and regression tasks. The results produced advocate favorably towards the efficacy of the proposed spiking neuron, particularly for regression tasks.
    Adversarial Examples Exist in Two-Layer ReLU Networks for Low Dimensional Linear Subspaces. (arXiv:2303.00783v2 [cs.LG] UPDATED)
    Despite a great deal of research, it is still not well-understood why trained neural networks are highly vulnerable to adversarial examples. In this work we focus on two-layer neural networks trained using data which lie on a low dimensional linear subspace. We show that standard gradient methods lead to non-robust neural networks, namely, networks which have large gradients in directions orthogonal to the data subspace, and are susceptible to small adversarial $L_2$-perturbations in these directions. Moreover, we show that decreasing the initialization scale of the training algorithm, or adding $L_2$ regularization, can make the trained network more robust to adversarial perturbations orthogonal to the data.
    Bilevel Optimization with a Lower-level Contraction: Optimal Sample Complexity without Warm-start. (arXiv:2202.03397v4 [stat.ML] UPDATED)
    We analyse a general class of bilevel problems, in which the upper-level problem consists in the minimization of a smooth objective function and the lower-level problem is to find the fixed point of a smooth contraction map. This type of problems include instances of meta-learning, equilibrium models, hyperparameter optimization and data poisoning adversarial attacks. Several recent works have proposed algorithms which warm-start the lower-level problem, i.e.~they use the previous lower-level approximate solution as a staring point for the lower-level solver. This warm-start procedure allows one to improve the sample complexity in both the stochastic and deterministic settings, achieving in some cases the order-wise optimal sample complexity. However, there are situations, e.g., meta learning and equilibrium models, in which the warm-start procedure is not well-suited or ineffective. In this work we show that without warm-start, it is still possible to achieve order-wise (near) optimal sample complexity. In particular, we propose a simple method which uses (stochastic) fixed point iterations at the lower-level and projected inexact gradient descent at the upper-level, that reaches an $\epsilon$-stationary point using $O(\epsilon^{-2})$ and $\tilde{O}(\epsilon^{-1})$ samples for the stochastic and the deterministic setting, respectively. Finally, compared to methods using warm-start, our approach yields a simpler analysis that does not need to study the coupled interactions between the upper-level and lower-level iterates.
    What Constitutes a Faithful Summary? Preserving Author Perspectives in News Summarization. (arXiv:2311.09741v1 [cs.CL])
    In this work, we take a first step towards designing summarization systems that are faithful to the author's opinions and perspectives. Focusing on a case study of preserving political perspectives in news summarization, we find that existing approaches alter the political opinions and stances of news articles in more than 50% of summaries, misrepresenting the intent and perspectives of the news authors. We thus propose P^3Sum, a diffusion model-based summarization approach controlled by political perspective classifiers. In P^3Sum, the political leaning of a generated summary is iteratively evaluated at each decoding step, and any drift from the article's original stance incurs a loss back-propagated to the embedding layers, steering the political stance of the summary at inference time. Extensive experiments on three news summarization datasets demonstrate that P^3Sum outperforms state-of-the-art summarization systems and large language models by up to 11.4% in terms of the success rate of stance preservation, with on-par performance on standard summarization utility metrics. These findings highlight the lacunae that even for state-of-the-art models it is still challenging to preserve author perspectives in news summarization, while P^3Sum presents an important first step towards evaluating and developing summarization systems that are faithful to author intent and perspectives.
    Co-data Learning for Bayesian Additive Regression Trees. (arXiv:2311.09997v1 [stat.ML])
    Medical prediction applications often need to deal with small sample sizes compared to the number of covariates. Such data pose problems for prediction and variable selection, especially when the covariate-response relationship is complicated. To address these challenges, we propose to incorporate co-data, i.e. external information on the covariates, into Bayesian additive regression trees (BART), a sum-of-trees prediction model that utilizes priors on the tree parameters to prevent overfitting. To incorporate co-data, an empirical Bayes (EB) framework is developed that estimates, assisted by a co-data model, prior covariate weights in the BART model. The proposed method can handle multiple types of co-data simultaneously. Furthermore, the proposed EB framework enables the estimation of the other hyperparameters of BART as well, rendering an appealing alternative to cross-validation. We show that the method finds relevant covariates and that it improves prediction compared to default BART in simulations. If the covariate-response relationship is nonlinear, the method benefits from the flexibility of BART to outperform regression-based co-data learners. Finally, the use of co-data enhances prediction in an application to diffuse large B-cell lymphoma prognosis based on clinical covariates, gene mutations, DNA translocations, and DNA copy number data. Keywords: Bayesian additive regression trees; Empirical Bayes; Co-data; High-dimensional data; Omics; Prediction
    Privacy Threats in Stable Diffusion Models. (arXiv:2311.09355v1 [cs.CV])
    This paper introduces a novel approach to membership inference attacks (MIA) targeting stable diffusion computer vision models, specifically focusing on the highly sophisticated Stable Diffusion V2 by StabilityAI. MIAs aim to extract sensitive information about a model's training data, posing significant privacy concerns. Despite its advancements in image synthesis, our research reveals privacy vulnerabilities in the stable diffusion models' outputs. Exploiting this information, we devise a black-box MIA that only needs to query the victim model repeatedly. Our methodology involves observing the output of a stable diffusion model at different generative epochs and training a classification model to distinguish when a series of intermediates originated from a training sample or not. We propose numerous ways to measure the membership features and discuss what works best. The attack's efficacy is assessed using the ROC AUC method, demonstrating a 60\% success rate in inferring membership information. This paper contributes to the growing body of research on privacy and security in machine learning, highlighting the need for robust defenses against MIAs. Our findings prompt a reevaluation of the privacy implications of stable diffusion models, urging practitioners and developers to implement enhanced security measures to safeguard against such attacks.
    Conditional Matrix Flows for Gaussian Graphical Models. (arXiv:2306.07255v2 [cs.LG] UPDATED)
    Studying conditional independence among many variables with few observations is a challenging task. Gaussian Graphical Models (GGMs) tackle this problem by encouraging sparsity in the precision matrix through $l_q$ regularization with $q\leq1$. However, most GMMs rely on the $l_1$ norm because the objective is highly non-convex for sub-$l_1$ pseudo-norms. In the frequentist formulation, the $l_1$ norm relaxation provides the solution path as a function of the shrinkage parameter $\lambda$. In the Bayesian formulation, sparsity is instead encouraged through a Laplace prior, but posterior inference for different $\lambda$ requires repeated runs of expensive Gibbs samplers. Here we propose a general framework for variational inference with matrix-variate Normalizing Flow in GGMs, which unifies the benefits of frequentist and Bayesian frameworks. As a key improvement on previous work, we train with one flow a continuum of sparse regression models jointly for all regularization parameters $\lambda$ and all $l_q$ norms, including non-convex sub-$l_1$ pseudo-norms. Within one model we thus have access to (i) the evolution of the posterior for any $\lambda$ and any $l_q$ (pseudo-) norm, (ii) the marginal log-likelihood for model selection, and (iii) the frequentist solution paths through simulated annealing in the MAP limit.
    Accelerating material discovery with a threshold-driven hybrid acquisition policy-based Bayesian optimization. (arXiv:2311.09591v1 [cs.LG])
    Advancements in materials play a crucial role in technological progress. However, the process of discovering and developing materials with desired properties is often impeded by substantial experimental costs, extensive resource utilization, and lengthy development periods. To address these challenges, modern approaches often employ machine learning (ML) techniques such as Bayesian Optimization (BO), which streamline the search for optimal materials by iteratively selecting experiments that are most likely to yield beneficial results. However, traditional BO methods, while beneficial, often struggle with balancing the trade-off between exploration and exploitation, leading to sub-optimal performance in material discovery processes. This paper introduces a novel Threshold-Driven UCB-EI Bayesian Optimization (TDUE-BO) method, which dynamically integrates the strengths of Upper Confidence Bound (UCB) and Expected Improvement (EI) acquisition functions to optimize the material discovery process. Unlike the classical BO, our method focuses on efficiently navigating the high-dimensional material design space (MDS). TDUE-BO begins with an exploration-focused UCB approach, ensuring a comprehensive initial sweep of the MDS. As the model gains confidence, indicated by reduced uncertainty, it transitions to the more exploitative EI method, focusing on promising areas identified earlier. The UCB-to-EI switching policy dictated guided through continuous monitoring of the model uncertainty during each step of sequential sampling results in navigating through the MDS more efficiently while ensuring rapid convergence. The effectiveness of TDUE-BO is demonstrated through its application on three different material datasets, showing significantly better approximation and optimization performance over the EI and UCB-based BO methods in terms of the RMSE scores and convergence efficiency, respectively.
    Towards More Realistic Membership Inference Attacks on Large Diffusion Models. (arXiv:2306.12983v2 [cs.LG] UPDATED)
    Generative diffusion models, including Stable Diffusion and Midjourney, can generate visually appealing, diverse, and high-resolution images for various applications. These models are trained on billions of internet-sourced images, raising significant concerns about the potential unauthorized use of copyright-protected images. In this paper, we examine whether it is possible to determine if a specific image was used in the training set, a problem known in the cybersecurity community and referred to as a membership inference attack. Our focus is on Stable Diffusion, and we address the challenge of designing a fair evaluation framework to answer this membership question. We propose a methodology to establish a fair evaluation setup and apply it to Stable Diffusion, enabling potential extensions to other generative models. Utilizing this evaluation setup, we execute membership attacks (both known and newly introduced). Our research reveals that previously proposed evaluation setups do not provide a full understanding of the effectiveness of membership inference attacks. We conclude that the membership inference attack remains a significant challenge for large diffusion models (often deployed as black-box systems), indicating that related privacy and copyright issues will persist in the foreseeable future.
    H-Packer: Holographic Rotationally Equivariant Convolutional Neural Network for Protein Side-Chain Packing. (arXiv:2311.09312v1 [q-bio.BM])
    Accurately modeling protein 3D structure is essential for the design of functional proteins. An important sub-task of structure modeling is protein side-chain packing: predicting the conformation of side-chains (rotamers) given the protein's backbone structure and amino-acid sequence. Conventional approaches for this task rely on expensive sampling procedures over hand-crafted energy functions and rotamer libraries. Recently, several deep learning methods have been developed to tackle the problem in a data-driven way, albeit with vastly different formulations (from image-to-image translation to directly predicting atomic coordinates). Here, we frame the problem as a joint regression over the side-chains' true degrees of freedom: the dihedral $\chi$ angles. We carefully study possible objective functions for this task, while accounting for the underlying symmetries of the task. We propose Holographic Packer (H-Packer), a novel two-stage algorithm for side-chain packing built on top of two light-weight rotationally equivariant neural networks. We evaluate our method on CASP13 and CASP14 targets. H-Packer is computationally efficient and shows favorable performance against conventional physics-based algorithms and is competitive against alternative deep learning solutions.
    Striped Attention: Faster Ring Attention for Causal Transformers. (arXiv:2311.09431v1 [cs.LG])
    To help address the growing demand for ever-longer sequence lengths in transformer models, Liu et al. recently proposed Ring Attention, an exact attention algorithm capable of overcoming per-device memory bottle- necks by distributing self-attention across multiple devices. In this paper, we study the performance characteristics of Ring Attention in the important special case of causal transformer models, and identify a key workload imbal- ance due to triangular structure of causal attention computations. We propose a simple extension to Ring Attention, which we call Striped Attention to fix this imbalance. Instead of devices having contiguous subsequences, each device has a subset of tokens distributed uniformly throughout the sequence, which we demonstrate leads to more even workloads. In experiments running Striped Attention on A100 GPUs and TPUv4s, we are able to achieve up to 1.45x end-to-end throughput improvements over the original Ring Attention algorithm on causal transformer training at a sequence length of 256k. Furthermore, on 16 TPUv4 chips, we were able to achieve 1.65x speedups at sequence lengths of 786k. We release the code for our experiments as open source
    Augmenting Unsupervised Reinforcement Learning with Self-Reference. (arXiv:2311.09692v1 [cs.LG])
    Humans possess the ability to draw on past experiences explicitly when learning new tasks and applying them accordingly. We believe this capacity for self-referencing is especially advantageous for reinforcement learning agents in the unsupervised pretrain-then-finetune setting. During pretraining, an agent's past experiences can be explicitly utilized to mitigate the nonstationarity of intrinsic rewards. In the finetuning phase, referencing historical trajectories prevents the unlearning of valuable exploratory behaviors. Motivated by these benefits, we propose the Self-Reference (SR) approach, an add-on module explicitly designed to leverage historical information and enhance agent performance within the pretrain-finetune paradigm. Our approach achieves state-of-the-art results in terms of Interquartile Mean (IQM) performance and Optimality Gap reduction on the Unsupervised Reinforcement Learning Benchmark for model-free methods, recording an 86% IQM and a 16% Optimality Gap. Additionally, it improves current algorithms by up to 17% IQM and reduces the Optimality Gap by 31%. Beyond performance enhancement, the Self-Reference add-on also increases sample efficiency, a crucial attribute for real-world applications.
    Beyond Detection: Unveiling Fairness Vulnerabilities in Abusive Language Models. (arXiv:2311.09428v1 [cs.CL])
    This work investigates the potential of undermining both fairness and detection performance in abusive language detection. In a dynamic and complex digital world, it is crucial to investigate the vulnerabilities of these detection models to adversarial fairness attacks to improve their fairness robustness. We propose a simple yet effective framework FABLE that leverages backdoor attacks as they allow targeted control over the fairness and detection performance. FABLE explores three types of trigger designs (i.e., rare, artificial, and natural triggers) and novel sampling strategies. Specifically, the adversary can inject triggers into samples in the minority group with the favored outcome (i.e., ``non-abusive'') and flip their labels to the unfavored outcome, i.e., ``abusive''. Experiments on benchmark datasets demonstrate the effectiveness of FABLE attacking fairness and utility in abusive language detection.
    Data Consistent Deep Rigid MRI Motion Correction. (arXiv:2301.10365v2 [eess.IV] UPDATED)
    Motion artifacts are a pervasive problem in MRI, leading to misdiagnosis or mischaracterization in population-level imaging studies. Current retrospective rigid intra-slice motion correction techniques jointly optimize estimates of the image and the motion parameters. In this paper, we use a deep network to reduce the joint image-motion parameter search to a search over rigid motion parameters alone. Our network produces a reconstruction as a function of two inputs: corrupted k-space data and motion parameters. We train the network using simulated, motion-corrupted k-space data generated with known motion parameters. At test-time, we estimate unknown motion parameters by minimizing a data consistency loss between the motion parameters, the network-based image reconstruction given those parameters, and the acquired measurements. Intra-slice motion correction experiments on simulated and realistic 2D fast spin echo brain MRI achieve high reconstruction fidelity while providing the benefits of explicit data consistency optimization. Our code is publicly available at https://www.github.com/nalinimsingh/neuroMoCo.
    Biomembrane-based Memcapacitive Reservoir Computing System for Energy Efficient Temporal Data Processing. (arXiv:2305.12025v2 [cs.LG] UPDATED)
    Reservoir computing is a highly efficient machine learning framework for processing temporal data by extracting features from the input signal and mapping them into higher dimensional spaces. Physical reservoir layers have been realized using spintronic oscillators, atomic switch networks, silicon photonic modules, ferroelectric transistors, and volatile memristors. However, these devices are intrinsically energy-dissipative due to their resistive nature, which leads to increased power consumption. Therefore, capacitive memory devices can provide a more energy-efficient approach. Here, we leverage volatile biomembrane-based memcapacitors that closely mimic certain short-term synaptic plasticity functions as reservoirs to solve classification tasks and analyze time-series data in simulation and experimentally. Our system achieves a 99.6% accuracy rate for spoken digit classification and a normalized mean square error of 7.81*10^{-4} in a second-order non-linear regression task. Furthermore, to showcase the device's real-time temporal data processing capability, we achieve 100% accuracy for a real-time epilepsy detection problem from an inputted electroencephalography (EEG) signal. Most importantly, we demonstrate that each memcapacitor consumes an average of 41.5 fJ of energy per spike, regardless of the selected input voltage pulse width, while maintaining an average power of 415 fW for a pulse width of 100 ms. These values are orders of magnitude lower than those achieved by state-of-the-art memristors used as reservoirs. Lastly, we believe the biocompatible, soft nature of our memcapacitor makes it highly suitable for computing and signal-processing applications in biological environments.
    Beyond PCA: A Probabilistic Gram-Schmidt Approach to Feature Extraction. (arXiv:2311.09386v1 [cs.LG])
    Linear feature extraction at the presence of nonlinear dependencies among the data is a fundamental challenge in unsupervised learning. We propose using a Probabilistic Gram-Schmidt (PGS) type orthogonalization process in order to detect and map out redundant dimensions. Specifically, by applying the PGS process over any family of functions which presumably captures the nonlinear dependencies in the data, we construct a series of covariance matrices that can either be used to remove those dependencies from the principal components, or to identify new large-variance directions. In the former case, we prove that under certain assumptions the resulting algorithms detect and remove nonlinear dependencies whenever those dependencies lie in the linear span of the chosen function family. In the latter, we provide information-theoretic guarantees in terms of entropy reduction. Both proposed methods extract linear features from the data while removing nonlinear redundancies. We provide simulation results on synthetic and real-world datasets which show improved performance over PCA and state-of-the-art linear feature extraction algorithms, both in terms of variance maximization of the extracted features, and in terms of improved performance of classification algorithms.
    Generative AI-Based Probabilistic Constellation Shaping With Diffusion Models. (arXiv:2311.09349v1 [cs.IT])
    Diffusion models are at the vanguard of generative AI research with renowned solutions such as ImageGen by Google Brain and DALL.E 3 by OpenAI. Nevertheless, the potential merits of diffusion models for communication engineering applications are not fully understood yet. In this paper, we aim to unleash the power of generative AI for PHY design of constellation symbols in communication systems. Although the geometry of constellations is predetermined according to networking standards, e.g., quadrature amplitude modulation (QAM), probabilistic shaping can design the probability of occurrence (generation) of constellation symbols. This can help improve the information rate and decoding performance of communication systems. We exploit the ``denoise-and-generate'' characteristics of denoising diffusion probabilistic models (DDPM) for probabilistic constellation shaping. The key idea is to learn generating constellation symbols out of noise, ``mimicking'' the way the receiver performs symbol reconstruction. This way, we make the constellation symbols sent by the transmitter, and what is inferred (reconstructed) at the receiver become as similar as possible, resulting in as few mismatches as possible. Our results show that the generative AI-based scheme outperforms deep neural network (DNN)-based benchmark and uniform shaping, while providing network resilience as well as robust out-of-distribution performance under low-SNR regimes and non-Gaussian assumptions. Numerical evaluations highlight 30% improvement in terms of cosine similarity and a threefold improvement in terms of mutual information compared to DNN-based approach for 64-QAM geometry.
    Time-dependent Probabilistic Generative Models for Disease Progression. (arXiv:2311.09369v1 [stat.ML])
    Electronic health records contain valuable information for monitoring patients' health trajectories over time. Disease progression models have been developed to understand the underlying patterns and dynamics of diseases using these data as sequences. However, analyzing temporal data from EHRs is challenging due to the variability and irregularities present in medical records. We propose a Markovian generative model of treatments developed to (i) model the irregular time intervals between medical events; (ii) classify treatments into subtypes based on the patient sequence of medical events and the time intervals between them; and (iii) segment treatments into subsequences of disease progression patterns. We assume that sequences have an associated structure of latent variables: a latent class representing the different subtypes of treatments; and a set of latent stages indicating the phase of progression of the treatments. We use the Expectation-Maximization algorithm to learn the model, which is efficiently solved with a dynamic programming-based method. Various parametric models have been employed to model the time intervals between medical events during the learning process, including the geometric, exponential, and Weibull distributions. The results demonstrate the effectiveness of our model in recovering the underlying model from data and accurately modeling the irregular time intervals between medical actions.  ( 2 min )
    CRISPR: Eliminating Bias Neurons from an Instruction-following Language Model. (arXiv:2311.09627v1 [cs.AI])
    Large language models (LLMs) executing tasks through instruction-based prompts often face challenges stemming from distribution differences between user instructions and training instructions. This leads to distractions and biases, especially when dealing with inconsistent dynamic labels. In this paper, we introduces a novel bias mitigation method, CRISPR, designed to alleviate instruction-label biases in LLMs. CRISPR utilizes attribution methods to identify bias neurons influencing biased outputs and employs pruning to eliminate the bias neurons. Experimental results demonstrate the method's effectiveness in mitigating biases in instruction-based prompting, enhancing language model performance on social bias benchmarks without compromising pre-existing knowledge. CRISPR proves highly practical, model-agnostic, offering flexibility in adapting to evolving social biases.
    Fossil 2.0: Formal Certificate Synthesis for the Verification and Control of Dynamical Models. (arXiv:2311.09793v1 [eess.SY])
    This paper presents Fossil 2.0, a new major release of a software tool for the synthesis of certificates (e.g., Lyapunov and barrier functions) for dynamical systems modelled as ordinary differential and difference equations. Fossil 2.0 is much improved from its original release, including new interfaces, a significantly expanded certificate portfolio, controller synthesis and enhanced extensibility. We present these new features as part of this tool paper. Fossil implements a counterexample-guided inductive synthesis (CEGIS) loop ensuring the soundness of the method. Our tool uses neural networks as templates to generate candidate functions, which are then formally proven by an SMT solver acting as an assertion verifier. Improvements with respect to the first release include a wider range of certificates, synthesis of control laws, and support for discrete-time models.  ( 2 min )
    FOCUS: Fairness via Agent-Awareness for Federated Learning on Heterogeneous Data. (arXiv:2207.10265v4 [cs.LG] UPDATED)
    Federated learning (FL) allows agents to jointly train a global model without sharing their local data. However, due to the heterogeneous nature of local data, it is challenging to optimize or even define fairness of the trained global model for the agents. For instance, existing work usually considers accuracy equity as fairness for different agents in FL, which is limited, especially under the heterogeneous setting, since it is intuitively "unfair" to enforce agents with high-quality data to achieve similar accuracy to those who contribute low-quality data, which may discourage the agents from participating in FL. In this work, we propose a formal FL fairness definition, fairness via agent-awareness (FAA), which takes different contributions of heterogeneous agents into account. Under FAA, the performance of agents with high-quality data will not be sacrificed just due to the existence of large amounts of agents with low-quality data. In addition, we propose a fair FL training algorithm based on agent clustering (FOCUS) to achieve fairness in FL measured by FAA. Theoretically, we prove the convergence and optimality of FOCUS under mild conditions for linear and general convex loss functions with bounded smoothness. We also prove that FOCUS always achieves higher fairness in terms of FAA compared with standard FedAvg under both linear and general convex loss functions. Empirically, we show that on four FL datasets, including synthetic data, images, and texts, FOCUS achieves significantly higher fairness in terms of FAA while maintaining competitive prediction accuracy compared with FedAvg and state-of-the-art fair FL algorithms.
    Performance Trade-offs of Watermarking Large Language Models. (arXiv:2311.09816v1 [cs.CL])
    Amidst growing concerns of large language models (LLMs) being misused for generating misinformation or completing homework assignments, watermarking has emerged as an effective solution for distinguishing human-written and LLM-generated text. A prominent watermarking strategy is to embed a signal into generated text by upsampling a (pseudorandomly-chosen) subset of tokens at every generation step. Although this signal is imperceptible to a human reader, it is detectable through statistical testing. However, implanting such signals alters the model's output distribution and can have unintended effects when watermarked LLMs are used for downstream applications. In this work, we evaluate the performance of watermarked LLMs on a diverse suite of tasks, including text classification, textual entailment, reasoning, question answering, translation, summarization, and language modeling. We find that watermarking has negligible impact on the performance of tasks posed as k-class classification problems in the average case. However, the accuracy can plummet to that of a random classifier for some scenarios (that occur with non-negligible probability). Tasks that are cast as multiple-choice questions and short-form generation are surprisingly unaffected by watermarking. For long-form generation tasks, including summarization and translation, we see a drop of 15-20% in the performance due to watermarking. Our findings highlight the trade-offs that users should be cognizant of when using watermarked models, and point to cases where future research could improve existing trade-offs.
    Xputer: Bridging Data Gaps with NMF, XGBoost, and a Streamlined GUI Experience. (arXiv:2311.09989v1 [cs.LG])
    The rapid proliferation of data across diverse fields has accentuated the importance of accurate imputation for missing values. This task is crucial for ensuring data integrity and deriving meaningful insights. In response to this challenge, we present Xputer, a novel imputation tool that adeptly integrates Non-negative Matrix Factorization (NMF) with the predictive strengths of XGBoost. One of Xputer's standout features is its versatility: it supports zero imputation, enables hyperparameter optimization through Optuna, and allows users to define the number of iterations. For enhanced user experience and accessibility, we have equipped Xputer with an intuitive Graphical User Interface (GUI) ensuring ease of handling, even for those less familiar with computational tools. In performance benchmarks, Xputer not only rivals the computational speed of established tools such as IterativeImputer but also often outperforms them in terms of imputation accuracy. Furthermore, Xputer autonomously handles a diverse spectrum of data types, including categorical, continuous, and Boolean, eliminating the need for prior preprocessing. Given its blend of performance, flexibility, and user-friendly design, Xputer emerges as a state-of-the-art solution in the realm of data imputation.
    GADBench: Revisiting and Benchmarking Supervised Graph Anomaly Detection. (arXiv:2306.12251v2 [cs.LG] UPDATED)
    With a long history of traditional Graph Anomaly Detection (GAD) algorithms and recently popular Graph Neural Networks (GNNs), it is still not clear (1) how they perform under a standard comprehensive setting, (2) whether GNNs can outperform traditional algorithms such as tree ensembles, and (3) how about their efficiency on large-scale graphs. In response, we introduce GADBench -- a benchmark tool dedicated to supervised anomalous node detection in static graphs. GADBench facilitates a detailed comparison across 29 distinct models on ten real-world GAD datasets, encompassing thousands to millions ($\sim$6M) nodes. Our main finding is that tree ensembles with simple neighborhood aggregation can outperform the latest GNNs tailored for the GAD task. We shed light on the current progress of GAD, setting a robust groundwork for subsequent investigations in this domain. GADBench is open-sourced at https://github.com/squareRoot3/GADBench.
    A Speed Odyssey for Deployable Quantization of LLMs. (arXiv:2311.09550v1 [cs.LG])
    The large language model era urges faster and less costly inference. Prior model compression works on LLMs tend to undertake a software-centric approach primarily focused on the simulated quantization performance. By neglecting the feasibility of deployment, these approaches are typically disabled in real practice. They used to drastically push down the quantization bit range for a reduced computation which might not be supported by the mainstream hardware, or involve sophisticated algorithms that introduce extra computation or memory access overhead. We argue that pursuing a hardware-centric approach in the construction of quantization algorithms is crucial. In this regard, we are driven to build our compression method on top of hardware awareness, eliminating impractical algorithm choices while maximizing the benefit of hardware acceleration. Our method, OdysseyLLM, comes with a novel W4A8 kernel implementation called FastGEMM and a combined recipe of quantization strategies. Extensive experiments manifest the superiority of our W4A8 method which brings the actual speed boosting up to \textbf{4$\times$} compared to Hugging Face FP16 inference and \textbf{2.23$\times$} vs. the state-of-the-art inference engine TensorRT-LLM in FP16, and \textbf{1.45$\times$} vs. TensorRT-LLM in INT8, yet without substantially harming the performance.
    GroupMixer: Patch-based Group Convolutional Neural Network for Breast Cancer Detection from Histopathological Images. (arXiv:2311.09846v1 [eess.IV])
    Diagnosis of breast cancer malignancy at the early stages is a crucial step for controlling its side effects. Histopathological analysis provides a unique opportunity for malignant breast cancer detection. However, such a task would be tedious and time-consuming for the histopathologists. Deep Neural Networks enable us to learn informative features directly from raw histopathological images without manual feature extraction. Although Convolutional Neural Networks (CNNs) have been the dominant architectures in the computer vision realm, Transformer-based architectures have shown promising results in different computer vision tasks. Although harnessing the capability of Transformer-based architectures for medical image analysis seems interesting, these architectures are large, have a significant number of trainable parameters, and require large datasets to be trained on, which are usually rare in the medical domain. It has been claimed and empirically proved that at least part of the superior performance of Transformer-based architectures in Computer Vision domain originates from patch embedding operation. In this paper, we borrowed the previously introduced idea of integrating a fully Convolutional Neural Network architecture with Patch Embedding operation and presented an efficient CNN architecture for breast cancer malignancy detection from histopathological images. Despite the number of parameters that is significantly smaller than other methods, the accuracy performance metrics achieved 97.65%, 98.92%, 99.21%, and 98.01% for 40x, 100x, 200x, and 400x magnifications respectively. We took a step forward and modified the architecture using Group Convolution and Channel Shuffling ideas and reduced the number of trainable parameters even more with a negligible decline in performance and achieved 95.42%, 98.16%, 96.05%, and 97.92% accuracy for the mentioned magnifications respectively.
    Identifying Systems with Symmetries using Equivariant Autoregressive Reservoir Computers. (arXiv:2311.09511v1 [eess.SY])
    The investigation reported in this document focuses on identifying systems with symmetries using equivariant autoregressive reservoir computers. General results in structured matrix approximation theory are presented, exploring a two-fold approach. Firstly, a comprehensive examination of generic symmetry-preserving nonlinear time delay embedding is conducted. This involves analyzing time series data sampled from an equivariant system under study. Secondly, sparse least-squares methods are applied to discern approximate representations of the output coupling matrices. These matrices play a pivotal role in determining the nonlinear autoregressive representation of an equivariant system. The structural characteristics of these matrices are dictated by the set of symmetries inherent in the system. The document outlines prototypical algorithms derived from the described techniques, offering insight into their practical applications. Emphasis is placed on their effectiveness in the identification and predictive simulation of equivariant nonlinear systems, regardless of whether such systems exhibit chaotic behavior.
    Know Thy Neighbors: A Graph Based Approach for Effective Sensor-Based Human Activity Recognition in Smart Homes. (arXiv:2311.09514v1 [cs.LG])
    There has been a resurgence of applications focused on Human Activity Recognition (HAR) in smart homes, especially in the field of ambient intelligence and assisted living technologies. However, such applications present numerous significant challenges to any automated analysis system operating in the real world, such as variability, sparsity, and noise in sensor measurements. Although state-of-the-art HAR systems have made considerable strides in addressing some of these challenges, they especially suffer from a practical limitation: they require successful pre-segmentation of continuous sensor data streams before automated recognition, i.e., they assume that an oracle is present during deployment, which is capable of identifying time windows of interest across discrete sensor events. To overcome this limitation, we propose a novel graph-guided neural network approach that performs activity recognition by learning explicit co-firing relationships between sensors. We accomplish this by learning a more expressive graph structure representing the sensor network in a smart home, in a data-driven manner. Our approach maps discrete input sensor measurements to a feature space through the application of attention mechanisms and hierarchical pooling of node embeddings. We demonstrate the effectiveness of our proposed approach by conducting several experiments on CASAS datasets, showing that the resulting graph-guided neural network outperforms the state-of-the-art method for HAR in smart homes across multiple datasets and by large margins. These results are promising because they push HAR for smart homes closer to real-world applications.  ( 3 min )
    Scaling User Modeling: Large-scale Online User Representations for Ads Personalization in Meta. (arXiv:2311.09544v1 [cs.IR])
    Effective user representations are pivotal in personalized advertising. However, stringent constraints on training throughput, serving latency, and memory, often limit the complexity and input feature set of online ads ranking models. This challenge is magnified in extensive systems like Meta's, which encompass hundreds of models with diverse specifications, rendering the tailoring of user representation learning for each model impractical. To address these challenges, we present Scaling User Modeling (SUM), a framework widely deployed in Meta's ads ranking system, designed to facilitate efficient and scalable sharing of online user representation across hundreds of ads models. SUM leverages a few designated upstream user models to synthesize user embeddings from massive amounts of user features with advanced modeling techniques. These embeddings then serve as inputs to downstream online ads ranking models, promoting efficient representation sharing. To adapt to the dynamic nature of user features and ensure embedding freshness, we designed SUM Online Asynchronous Platform (SOAP), a latency free online serving system complemented with model freshness and embedding stabilization, which enables frequent user model updates and online inference of user embeddings upon each user request. We share our hands-on deployment experiences for the SUM framework and validate its superiority through comprehensive experiments. To date, SUM has been launched to hundreds of ads ranking models in Meta, processing hundreds of billions of user requests daily, yielding significant online metric gains and infrastructure cost savings.
    Robust Contrastive Learning With Theory Guarantee. (arXiv:2311.09671v1 [cs.LG])
    Contrastive learning (CL) is a self-supervised training paradigm that allows us to extract meaningful features without any label information. A typical CL framework is divided into two phases, where it first tries to learn the features from unlabelled data, and then uses those features to train a linear classifier with the labeled data. While a fair amount of existing theoretical works have analyzed how the unsupervised loss in the first phase can support the supervised loss in the second phase, none has examined the connection between the unsupervised loss and the robust supervised loss, which can shed light on how to construct an effective unsupervised loss for the first phase of CL. To fill this gap, our work develops rigorous theories to dissect and identify which components in the unsupervised loss can help improve the robust supervised loss and conduct proper experiments to verify our findings.
    Alternatives to the Scaled Dot Product for Attention in the Transformer Neural Network Architecture. (arXiv:2311.09406v1 [cs.LG])
    The transformer neural network architecture uses a form of attention in which the dot product of query and key is divided by the square root of the key dimension before applying softmax. This scaling of the dot product is designed to avoid the absolute value of the dot products becoming so large that applying softmax leads to vanishing gradients. In this paper, we propose some alternative scalings, including dividing the dot product instead by the sum of the key lengths before applying softmax. We use simulated keys and queries to show that in many situations this appears to be more effective at avoiding regions where applying softmax leads to vanishing gradients.
    Contribution Evaluation in Federated Learning: Examining Current Approaches. (arXiv:2311.09856v1 [cs.LG])
    Federated Learning (FL) has seen increasing interest in cases where entities want to collaboratively train models while maintaining privacy and governance over their data. In FL, clients with private and potentially heterogeneous data and compute resources come together to train a common model without raw data ever leaving their locale. Instead, the participants contribute by sharing local model updates, which, naturally, differ in quality. Quantitatively evaluating the worth of these contributions is termed the Contribution Evaluation (CE) problem. We review current CE approaches from the underlying mathematical framework to efficiently calculate a fair value for each client. Furthermore, we benchmark some of the most promising state-of-the-art approaches, along with a new one we introduce, on MNIST and CIFAR-10, to showcase their differences. Designing a fair and efficient CE method, while a small part of the overall FL system design, is tantamount to the mainstream adoption of FL.
    Improved Sparse Ising Optimization. (arXiv:2311.09275v1 [cs.LG])
    Sparse Ising problems can be found in application areas such as logistics, condensed matter physics and training of deep Boltzmann networks, but can be very difficult to tackle with high efficiency and accuracy. This report presents new data demonstrating significantly higher performance on some longstanding benchmark problems with up to 20,000 variables. The data come from a new heuristic algorithm tested on the large sparse instances from the Gset benchmark suite. Relative to leading reported combinations of speed and accuracy (e.g., from Toshiba's Simulated Bifurcation Machine and Breakout Local Search), a proof-of-concept implementation reached targets 2-4 orders of magnitude faster. For two instances (G72 and G77) the new algorithm discovered a better solution than all previously reported values. Solution bitstrings confirming these two best solutions are provided. The data suggest exciting possibilities for pushing the sparse Ising performance frontier to potentially strengthen algorithm portfolios, AI toolkits and decision-making systems.
    Soft Matching Distance: A metric on neural representations that captures single-neuron tuning. (arXiv:2311.09466v1 [cs.LG])
    Common measures of neural representational (dis)similarity are designed to be insensitive to rotations and reflections of the neural activation space. Motivated by the premise that the tuning of individual units may be important, there has been recent interest in developing stricter notions of representational (dis)similarity that require neurons to be individually matched across networks. When two networks have the same size (i.e. same number of neurons), a distance metric can be formulated by optimizing over neuron index permutations to maximize tuning curve alignment. However, it is not clear how to generalize this metric to measure distances between networks with different sizes. Here, we leverage a connection to optimal transport theory to derive a natural generalization based on "soft" permutations. The resulting metric is symmetric, satisfies the triangle inequality, and can be interpreted as a Wasserstein distance between two empirical distributions. Further, our proposed metric avoids counter-intuitive outcomes suffered by alternative approaches, and captures complementary geometric insights into neural representations that are entirely missed by rotation-invariant metrics.
    Polynomially Over-Parameterized Convolutional Neural Networks Contain Structured Strong Winning Lottery Tickets. (arXiv:2311.09858v1 [cs.LG])
    The Strong Lottery Ticket Hypothesis (SLTH) states that randomly-initialised neural networks likely contain subnetworks that perform well without any training. Although unstructured pruning has been extensively studied in this context, its structured counterpart, which can deliver significant computational and memory efficiency gains, has been largely unexplored. One of the main reasons for this gap is the limitations of the underlying mathematical tools used in formal analyses of the SLTH. In this paper, we overcome these limitations: we leverage recent advances in the multidimensional generalisation of the Random Subset-Sum Problem and obtain a variant that admits the stochastic dependencies that arise when addressing structured pruning in the SLTH. We apply this result to prove, for a wide class of random Convolutional Neural Networks, the existence of structured subnetworks that can approximate any sufficiently smaller network. This result provides the first sub-exponential bound around the SLTH for structured pruning, opening up new avenues for further research on the hypothesis and contributing to the understanding of the role of over-parameterization in deep learning.
    Correlation-aware active learning for surgery video segmentation. (arXiv:2311.08811v1 [cs.CV] CROSS LISTED)
    Semantic segmentation is a complex task that relies heavily on large amounts of annotated image data. However, annotating such data can be time-consuming and resource-intensive, especially in the medical domain. Active Learning (AL) is a popular approach that can help to reduce this burden by iteratively selecting images for annotation to improve the model performance. In the case of video data, it is important to consider the model uncertainty and the temporal nature of the sequences when selecting images for annotation. This work proposes a novel AL strategy for surgery video segmentation, \COALSamp{}, COrrelation-aWare Active Learning. Our approach involves projecting images into a latent space that has been fine-tuned using contrastive learning and then selecting a fixed number of representative images from local clusters of video frames. We demonstrate the effectiveness of this approach on two video datasets of surgical instruments and three real-world video datasets. The datasets and code will be made publicly available upon receiving necessary approvals.
    Rates of Convergence in Certain Native Spaces of Approximations used in Reinforcement Learning. (arXiv:2309.07383v3 [eess.SY] UPDATED)
    This paper studies convergence rates for some value function approximations that arise in a collection of reproducing kernel Hilbert spaces (RKHS) $H(\Omega)$. By casting an optimal control problem in a specific class of native spaces, strong rates of convergence are derived for the operator equation that enables offline approximations that appear in policy iteration. Explicit upper bounds on error in value function and controller approximations are derived in terms of power function $\Pwr_{H,N}$ for the space of finite dimensional approximants $H_N$ in the native space $H(\Omega)$. These bounds are geometric in nature and refine some well-known, now classical results concerning convergence of approximations of value functions.  ( 2 min )
    A Risk-Sensitive Approach to Policy Optimization. (arXiv:2208.09106v2 [cs.LG] UPDATED)
    Standard deep reinforcement learning (DRL) aims to maximize expected reward, considering collected experiences equally in formulating a policy. This differs from human decision-making, where gains and losses are valued differently and outlying outcomes are given increased consideration. It also fails to capitalize on opportunities to improve safety and/or performance through the incorporation of distributional context. Several approaches to distributional DRL have been investigated, with one popular strategy being to evaluate the projected distribution of returns for possible actions. We propose a more direct approach whereby risk-sensitive objectives, specified in terms of the cumulative distribution function (CDF) of the distribution of full-episode rewards, are optimized. This approach allows for outcomes to be weighed based on relative quality, can be used for both continuous and discrete action spaces, and may naturally be applied in both constrained and unconstrained settings. We show how to compute an asymptotically consistent estimate of the policy gradient for a broad class of risk-sensitive objectives via sampling, subsequently incorporating variance reduction and regularization measures to facilitate effective on-policy learning. We then demonstrate that the use of moderately "pessimistic" risk profiles, which emphasize scenarios where the agent performs poorly, leads to enhanced exploration and a continual focus on addressing deficiencies. We test the approach using different risk profiles in six OpenAI Safety Gym environments, comparing to state of the art on-policy methods. Without cost constraints, we find that pessimistic risk profiles can be used to reduce cost while improving total reward accumulation. With cost constraints, they are seen to provide higher positive rewards than risk-neutral approaches at the prescribed allowable cost.
    Investigating the Corruption Robustness of Image Classifiers with Random Lp-norm Corruptions. (arXiv:2305.05400v2 [cs.LG] UPDATED)
    Robustness is a fundamental property of machine learning classifiers to achieve safety and reliability. In the fields of adversarial robustness and formal robustness verification of image classification models, robustness is commonly defined as the stability to all input variations within an Lp-norm distance. However, robustness to random corruptions is usually improved and evaluated using variations observed in the real-world, while mathematically defined Lp-norm corruptions are rarely considered. This study investigates the use of random Lp-norm corruptions to augment the training and test data of image classifiers. We adapt an approach from the field of adversarial robustness to assess the model robustness to imperceptible random corruptions. We empirically and theoretically investigate whether robustness is transferable across different Lp-norms and derive conclusions on which Lp-norm corruptions a model should be trained and evaluated on. We find that training data augmentation with L0-norm corruptions improves corruption robustness while maintaining accuracy compared to standard training and when applied on top of selected state-of-the-art data augmentation techniques.
    Banach-Tarski Embeddings and Transformers. (arXiv:2311.09387v1 [cs.LG])
    We introduce a new construction of embeddings of arbitrary recursive data structures into high dimensional vectors. These embeddings provide an interpretable model for the latent state vectors of transformers. We demonstrate that these embeddings can be decoded to the original data structure when the embedding dimension is sufficiently large. This decoding algorithm has a natural implementation as a transformer. We also show that these embedding vectors can be manipulated directly to perform computations on the underlying data without decoding. As an example we present an algorithm that constructs the embedded parse tree of an embedded token sequence using only vector operations in embedding space.  ( 2 min )
    Leveraging Citizen Science for Flood Extent Detection using Machine Learning Benchmark Dataset. (arXiv:2311.09276v1 [cs.CV])
    Accurate detection of inundated water extents during flooding events is crucial in emergency response decisions and aids in recovery efforts. Satellite Remote Sensing data provides a global framework for detecting flooding extents. Specifically, Sentinel-1 C-Band Synthetic Aperture Radar (SAR) imagery has proven to be useful in detecting water bodies due to low backscatter of water features in both co-polarized and cross-polarized SAR imagery. However, increased backscatter can be observed in certain flooded regions such as presence of infrastructure and trees - rendering simple methods such as pixel intensity thresholding and time-series differencing inadequate. Machine Learning techniques has been leveraged to precisely capture flood extents in flooded areas with bumps in backscatter but needs high amounts of labelled data to work desirably. Hence, we created a labeled known water body extent and flooded area extents during known flooding events covering about 36,000 sq. kilometers of regions within mainland U.S and Bangladesh. Further, We also leveraged citizen science by open-sourcing the dataset and hosting an open competition based on the dataset to rapidly prototype flood extent detection using community generated models. In this paper we present the information about the dataset, the data processing pipeline, a baseline model and the details about the competition, along with discussion on winning approaches. We believe the dataset adds to already existing datasets based on Sentinel-1C SAR data and leads to more robust modeling of flood extents. We also hope the results from the competition pushes the research in flood extent detection further.
    Divergences between Language Models and Human Brains. (arXiv:2311.09308v1 [cs.CL])
    Do machines and humans process language in similar ways? A recent line of research has hinted in the affirmative, demonstrating that human brain signals can be effectively predicted using the internal representations of language models (LMs). This is thought to reflect shared computational principles between LMs and human language processing. However, there are also clear differences in how LMs and humans acquire and use language, even if the final task they are performing is the same. Despite this, there is little work exploring systematic differences between human and machine language processing using brain data. To address this question, we examine the differences between LM representations and the human brain's responses to language, specifically by examining a dataset of Magnetoencephalography (MEG) responses to a written narrative. In doing so we identify three phenomena that, in prior work, LMs have been found to not capture well: emotional understanding, figurative language processing, and physical commonsense. By fine-tuning LMs on datasets related to these phenomena, we observe that fine-tuned LMs show improved alignment with human brain responses across these tasks. Our study implies that the observed divergences between LMs and human brains may stem from LMs' inadequate representation of these specific types of knowledge.  ( 2 min )
    Data-driven Preference Learning Methods for Sorting Problems with Multiple Temporal Criteria. (arXiv:2309.12620v2 [cs.LG] UPDATED)
    The advent of predictive methodologies has catalyzed the emergence of data-driven decision support across various domains. However, developing models capable of effectively handling input time series data presents an enduring challenge. This study presents novel preference learning approaches to multiple criteria sorting problems in the presence of temporal criteria. We first formulate a convex quadratic programming model characterized by fixed time discount factors, operating within a regularization framework. To enhance scalability and accommodate learnable time discount factors, we introduce a novel monotonic Recurrent Neural Network (mRNN). It is designed to capture the evolving dynamics of preferences over time while upholding critical properties inherent to MCS problems, including criteria monotonicity, preference independence, and the natural ordering of classes. The proposed mRNN can describe the preference dynamics by depicting marginal value functions and personalized time discount factors along with time, effectively amalgamating the interpretability of traditional MCS methods with the predictive potential offered by deep preference learning models. Comprehensive assessments of the proposed models are conducted, encompassing synthetic data scenarios and a real-case study centered on classifying valuable users within a mobile gaming app based on their historical in-app behavioral sequences. Empirical findings underscore the notable performance improvements achieved by the proposed models when compared to a spectrum of baseline methods, spanning machine learning, deep learning, and conventional multiple criteria sorting approaches.
    SurvTimeSurvival: Survival Analysis On The Patient With Multiple Visits/Records. (arXiv:2311.09854v1 [cs.LG])
    The accurate prediction of survival times for patients with severe diseases remains a critical challenge despite recent advances in artificial intelligence. This study introduces "SurvTimeSurvival: Survival Analysis On Patients With Multiple Visits/Records", utilizing the Transformer model to not only handle the complexities of time-varying covariates but also covariates data. We also tackle the data sparsity issue common to survival analysis datasets by integrating synthetic data generation into the learning process of our model. We show that our method outperforms state-of-the-art deep learning approaches on both covariates and time-varying covariates datasets. Our approach aims not only to enhance the understanding of individual patient survival trajectories across various medical conditions, thereby improving prediction accuracy, but also to play a pivotal role in designing clinical trials and creating new treatments.
    ICXML: An In-Context Learning Framework for Zero-Shot Extreme Multi-Label Classification. (arXiv:2311.09649v1 [cs.LG])
    This paper focuses on the task of Extreme Multi-Label Classification (XMC) whose goal is to predict multiple labels for each instance from an extremely large label space. While existing research has primarily focused on fully supervised XMC, real-world scenarios often lack complete supervision signals, highlighting the importance of zero-shot settings. Given the large label space, utilizing in-context learning approaches is not trivial. We address this issue by introducing In-Context Extreme Multilabel Learning (ICXML), a two-stage framework that cuts down the search space by generating a set of candidate labels through incontext learning and then reranks them. Extensive experiments suggest that ICXML advances the state of the art on two diverse public benchmarks.  ( 2 min )
    Network Wide Evacuation Traffic Prediction in a Rapidly Intensifying Hurricane from Traffic Detectors and Facebook Movement Data: A Deep Learning Approach. (arXiv:2311.09498v1 [cs.LG])
    Traffic prediction during hurricane evacuation is essential for optimizing the use of transportation infrastructures. It can reduce evacuation time by providing information on future congestion in advance. However, evacuation traffic prediction can be challenging as evacuation traffic patterns is significantly different than regular period traffic. A data-driven traffic prediction model is developed in this study by utilizing traffic detector and Facebook movement data during Hurricane Ian, a rapidly intensifying hurricane. We select 766 traffic detectors from Florida's 4 major interstates to collect traffic features. Additionally, we use Facebook movement data collected during Hurricane Ian's evacuation period. The deep-learning model is first trained on regular period (May-August 2022) data to understand regular traffic patterns and then Hurricane Ian's evacuation period data is used as test data. The model achieves 95% accuracy (RMSE = 356) during regular period, but it underperforms with 55% accuracy (RMSE = 1084) during the evacuation period. Then, a transfer learning approach is adopted where a pretrained model is used with additional evacuation related features to predict evacuation period traffic. After transfer learning, the model achieves 89% accuracy (RMSE = 514). Adding Facebook movement data further reduces model's RMSE value to 393 and increases accuracy to 93%. The proposed model is capable to forecast traffic up to 6-hours in advance. Evacuation traffic management officials can use the developed traffic prediction model to anticipate future traffic congestion in advance and take proactive measures to reduce delays during evacuation.
    A Knowledge Distillation Approach for Sepsis Outcome Prediction from Multivariate Clinical Time Series. (arXiv:2311.09566v1 [cs.LG])
    Sepsis is a life-threatening condition triggered by an extreme infection response. Our objective is to forecast sepsis patient outcomes using their medical history and treatments, while learning interpretable state representations to assess patients' risks in developing various adverse outcomes. While neural networks excel in outcome prediction, their limited interpretability remains a key issue. In this work, we use knowledge distillation via constrained variational inference to distill the knowledge of a powerful "teacher" neural network model with high predictive power to train a "student" latent variable model to learn interpretable hidden state representations to achieve high predictive performance for sepsis outcome prediction. Using real-world data from the MIMIC-IV database, we trained an LSTM as the "teacher" model to predict mortality for sepsis patients, given information about their recent history of vital signs, lab values and treatments. For our student model, we use an autoregressive hidden Markov model (AR-HMM) to learn interpretable hidden states from patients' clinical time series, and use the posterior distribution of the learned state representations to predict various downstream outcomes, including hospital mortality, pulmonary edema, need for diuretics, dialysis, and mechanical ventilation. Our results show that our approach successfully incorporates the constraint to achieve high predictive power similar to the teacher model, while maintaining the generative performance.
    Synthetically Enhanced: Unveiling Synthetic Data's Potential in Medical Imaging Research. (arXiv:2311.09402v1 [cs.CV])
    Chest X-rays (CXR) are the most common medical imaging study and are used to diagnose multiple medical conditions. This study examines the impact of synthetic data supplementation, using diffusion models, on the performance of deep learning (DL) classifiers for CXR analysis. We employed three datasets: CheXpert, MIMIC-CXR, and Emory Chest X-ray, training conditional denoising diffusion probabilistic models (DDPMs) to generate synthetic frontal radiographs. Our approach ensured that synthetic images mirrored the demographic and pathological traits of the original data. Evaluating the classifiers' performance on internal and external datasets revealed that synthetic data supplementation enhances model accuracy, particularly in detecting less prevalent pathologies. Furthermore, models trained on synthetic data alone approached the performance of those trained on real data. This suggests that synthetic data can potentially compensate for real data shortages in training robust DL models. However, despite promising outcomes, the superiority of real data persists.  ( 2 min )
    Spatial Bayesian Neural Networks. (arXiv:2311.09491v1 [stat.ML])
    Statistical models for spatial processes play a central role in statistical analyses of spatial data. Yet, it is the simple, interpretable, and well understood models that are routinely employed even though, as is revealed through prior and posterior predictive checks, these can poorly characterise the spatial heterogeneity in the underlying process of interest. Here, we propose a new, flexible class of spatial-process models, which we refer to as spatial Bayesian neural networks (SBNNs). An SBNN leverages the representational capacity of a Bayesian neural network; it is tailored to a spatial setting by incorporating a spatial "embedding layer" into the network and, possibly, spatially-varying network parameters. An SBNN is calibrated by matching its finite-dimensional distribution at locations on a fine gridding of space to that of a target process of interest. That process could be easy to simulate from or we have many realisations from it. We propose several variants of SBNNs, most of which are able to match the finite-dimensional distribution of the target process at the selected grid better than conventional BNNs of similar complexity. We also show that a single SBNN can be used to represent a variety of spatial processes often used in practice, such as Gaussian processes and lognormal processes. We briefly discuss the tools that could be used to make inference with SBNNs, and we conclude with a discussion of their advantages and limitations.  ( 2 min )
    Impact of Spatial Frequency Based Constraints on Adversarial Robustness. (arXiv:2104.12679v3 [cs.LG] UPDATED)
    Adversarial examples mainly exploit changes to input pixels to which humans are not sensitive to, and arise from the fact that models make decisions based on uninterpretable features. Interestingly, cognitive science reports that the process of interpretability for human classification decision relies predominantly on low spatial frequency components. In this paper, we investigate the robustness to adversarial perturbations of models enforced during training to leverage information corresponding to different spatial frequency ranges. We show that it is tightly linked to the spatial frequency characteristics of the data at stake. Indeed, depending on the data set, the same constraint may results in very different level of robustness (up to 0.41 adversarial accuracy difference). To explain this phenomenon, we conduct several experiments to enlighten influential factors such as the level of sensitivity to high frequencies, and the transferability of adversarial perturbations between original and low-pass filtered inputs.
    Zenkai -- Framework For Exploring Beyond Backpropagation. (arXiv:2311.09663v1 [cs.LG])
    Zenkai is an open-source framework designed to give researchers more control and flexibility over building and training deep learning machines. It does this by dividing the deep learning machine into layers of semi-autonomous learning machines with their own target and learning algorithm. This is to allow researchers greater exploration such as the use of non-differentiable layers or learning algorithms beyond those based on error backpropagation. Backpropagation Rumelhart et al. [1986] has powered deep learning to become one of the most exciting fields of the 21st century. As a result, a large number of software tools have been developed to support efficient implementation and training of neural networks through the use of backpropa- gation. While these have been critical to the success of deep learning, building frameworks around backpropagation can make it challenging to implement solutions that do not adhere to it. Zenkai aims to make it easier to get around these limitations and help researchers more easily explore new frontiers in deep learning that do not strictly adhere to the backpropagation framework.  ( 2 min )
    Exploring the Privacy-Energy Consumption Tradeoff for Split Federated Learning. (arXiv:2311.09441v1 [cs.LG])
    Split Federated Learning (SFL) has recently emerged as a promising distributed learning technology, leveraging the strengths of both federated learning and split learning. It emphasizes the advantages of rapid convergence while addressing privacy concerns. As a result, this innovation has received significant attention from both industry and academia. However, since the model is split at a specific layer, known as a cut layer, into both client-side and server-side models for the SFL, the choice of the cut layer in SFL can have a substantial impact on the energy consumption of clients and their privacy, as it influences the training burden and the output of the client-side models. Moreover, the design challenge of determining the cut layer is highly intricate, primarily due to the inherent heterogeneity in the computing and networking capabilities of clients. In this article, we provide a comprehensive overview of the SFL process and conduct a thorough analysis of energy consumption and privacy. This analysis takes into account the influence of various system parameters on the cut layer selection strategy. Additionally, we provide an illustrative example of the cut layer selection, aiming to minimize the risk of clients from reconstructing the raw data at the server while sustaining energy consumption within the required energy budget, which involve trade-offs. Finally, we address open challenges in this field including their applications to 6G technology. These directions represent promising avenues for future research and development.  ( 3 min )
    Nondestructive, quantitative viability analysis of 3D tissue cultures using machine learning image segmentation. (arXiv:2311.09354v1 [q-bio.QM])
    Ascertaining the collective viability of cells in different cell culture conditions has typically relied on averaging colorimetric indicators and is often reported out in simple binary readouts. Recent research has combined viability assessment techniques with image-based deep-learning models to automate the characterization of cellular properties. However, further development of viability measurements to assess the continuity of possible cellular states and responses to perturbation across cell culture conditions is needed. In this work, we demonstrate an image processing algorithm for quantifying cellular viability in 3D cultures without the need for assay-based indicators. We show that our algorithm performs similarly to a pair of human experts in whole-well images over a range of days and culture matrix compositions. To demonstrate potential utility, we perform a longitudinal study investigating the impact of a known therapeutic on pancreatic cancer spheroids. Using images taken with a high content imaging system, the algorithm successfully tracks viability at the individual spheroid and whole-well level. The method we propose reduces analysis time by 97% in comparison to the experts. Because the method is independent of the microscope or imaging system used, this approach lays the foundation for accelerating progress in and for improving the robustness and reproducibility of 3D culture analysis across biological and clinical research.  ( 3 min )
    Labeled Interactive Topic Models. (arXiv:2311.09438v1 [cs.LG])
    Topic models help users understand large document collections; however, topic models do not always find the ``right'' topics. While classical probabilistic and anchor-based topic models have interactive variants to guide models toward better topics, such interactions are not available for neural topic models such as the embedded topic model (\abr{etm}). We correct this lacuna by adding an intuitive interaction to neural topic models: users can label a topic with a word, and topics are updated so that the topic words are close to the label. This allows a user to refine topics based on their information need. While, interactivity is intuitive for \abr{etm}, we extend this framework to work with other neural topic models as well. We develop an interactive interface which allows users to interact and relabel topic models as they see fit. We evaluate our method through a human study, where users can relabel topics to find relevant documents. Using our method, user labeling improves document rank scores, helping to find more relevant documents to a given query when compared to no user labeling.  ( 2 min )
    Nothing Stands Still: A Spatiotemporal Benchmark on 3D Point Cloud Registration Under Large Geometric and Temporal Change. (arXiv:2311.09346v1 [cs.CV])
    Building 3D geometric maps of man-made spaces is a well-established and active field that is fundamental to computer vision and robotics. However, considering the evolving nature of built environments, it is essential to question the capabilities of current mapping efforts in handling temporal changes. In addition, spatiotemporal mapping holds significant potential for achieving sustainability and circularity goals. Existing mapping approaches focus on small changes, such as object relocation or self-driving car operation; in all cases where the main structure of the scene remains fixed. Consequently, these approaches fail to address more radical changes in the structure of the built environment, such as geometry and topology. To this end, we introduce the Nothing Stands Still (NSS) benchmark, which focuses on the spatiotemporal registration of 3D scenes undergoing large spatial and temporal change, ultimately creating one coherent spatiotemporal map. Specifically, the benchmark involves registering two or more partial 3D point clouds (fragments) from the same scene but captured from different spatiotemporal views. In addition to the standard pairwise registration, we assess the multi-way registration of multiple fragments that belong to any temporal stage. As part of NSS, we introduce a dataset of 3D point clouds recurrently captured in large-scale building indoor environments that are under construction or renovation. The NSS benchmark presents three scenarios of increasing difficulty, to quantify the generalization ability of point cloud registration methods over space (within one building and across buildings) and time. We conduct extensive evaluations of state-of-the-art methods on NSS. The results demonstrate the necessity for novel methods specifically designed to handle large spatiotemporal changes. The homepage of our benchmark is at this http URL
    Long-form Question Answering: An Iterative Planning-Retrieval-Generation Approach. (arXiv:2311.09383v1 [cs.CL])
    Long-form question answering (LFQA) poses a challenge as it involves generating detailed answers in the form of paragraphs, which go beyond simple yes/no responses or short factual answers. While existing QA models excel in questions with concise answers, LFQA requires handling multiple topics and their intricate relationships, demanding comprehensive explanations. Previous attempts at LFQA focused on generating long-form answers by utilizing relevant contexts from a corpus, relying solely on the question itself. However, they overlooked the possibility that the question alone might not provide sufficient information to identify the relevant contexts. Additionally, generating detailed long-form answers often entails aggregating knowledge from diverse sources. To address these limitations, we propose an LFQA model with iterative Planning, Retrieval, and Generation. This iterative process continues until a complete answer is generated for the given question. From an extensive experiment on both an open domain and a technical domain QA dataset, we find that our model outperforms the state-of-the-art models on various textual and factual metrics for the LFQA task.  ( 2 min )
    Strategic Data Augmentation with CTGAN for Smart Manufacturing: Enhancing Machine Learning Predictions of Paper Breaks in Pulp-and-Paper Production. (arXiv:2311.09333v1 [cs.LG])
    A significant challenge for predictive maintenance in the pulp-and-paper industry is the infrequency of paper breaks during the production process. In this article, operational data is analyzed from a paper manufacturing machine in which paper breaks are relatively rare but have a high economic impact. Utilizing a dataset comprising 18,398 instances derived from a quality assurance protocol, we address the scarcity of break events (124 cases) that pose a challenge for machine learning predictive models. With the help of Conditional Generative Adversarial Networks (CTGAN) and Synthetic Minority Oversampling Technique (SMOTE), we implement a novel data augmentation framework. This method ensures that the synthetic data mirrors the distribution of the real operational data but also seeks to enhance the performance metrics of predictive modeling. Before and after the data augmentation, we evaluate three different machine learning algorithms-Decision Trees (DT), Random Forest (RF), and Logistic Regression (LR). Utilizing the CTGAN-enhanced dataset, our study achieved significant improvements in predictive maintenance performance metrics. The efficacy of CTGAN in addressing data scarcity was evident, with the models' detection of machine breaks (Class 1) improving by over 30% for Decision Trees, 20% for Random Forest, and nearly 90% for Logistic Regression. With this methodological advancement, this study contributes to industrial quality control and maintenance scheduling by addressing rare event prediction in manufacturing processes.
    Challenges for Predictive Modeling with Neural Network Techniques using Error-Prone Dietary Intake Data. (arXiv:2311.09338v1 [cs.LG])
    Dietary intake data are routinely drawn upon to explore diet-health relationships. However, these data are often subject to measurement error, distorting the true relationships. Beyond measurement error, there are likely complex synergistic and sometimes antagonistic interactions between different dietary components, complicating the relationships between diet and health outcomes. Flexible models are required to capture the nuance that these complex interactions introduce. This complexity makes research on diet-health relationships an appealing candidate for the application of machine learning techniques, and in particular, neural networks. Neural networks are computational models that are able to capture highly complex, nonlinear relationships so long as sufficient data are available. While these models have been applied in many domains, the impacts of measurement error on the performance of predictive modeling has not been systematically investigated. However, dietary intake data are typically collected using self-report methods and are prone to large amounts of measurement error. In this work, we demonstrate the ways in which measurement error erodes the performance of neural networks, and illustrate the care that is required for leveraging these models in the presence of error. We demonstrate the role that sample size and replicate measurements play on model performance, indicate a motivation for the investigation of transformations to additivity, and illustrate the caution required to prevent model overfitting. While the past performance of neural networks across various domains make them an attractive candidate for examining diet-health relationships, our work demonstrates that substantial care and further methodological development are both required to observe increased predictive performance when applying these techniques, compared to more traditional statistical procedures.
    Straggler-resilient Federated Learning: Tackling Computation Heterogeneity with Layer-wise Partial Model Training in Mobile Edge Network. (arXiv:2311.10002v1 [cs.LG])
    Federated Learning (FL) enables many resource-limited devices to train a model collaboratively without data sharing. However, many existing works focus on model-homogeneous FL, where the global and local models are the same size, ignoring the inherently heterogeneous computational capabilities of different devices and restricting resource-constrained devices from contributing to FL. In this paper, we consider model-heterogeneous FL and propose Federated Partial Model Training (FedPMT), where devices with smaller computational capabilities work on partial models (subsets of the global model) and contribute to the global model. Different from Dropout-based partial model generation, which removes neurons in hidden layers at random, model training in FedPMT is achieved from the back-propagation perspective. As such, all devices in FedPMT prioritize the most crucial parts of the global model. Theoretical analysis shows that the proposed partial model training design has a similar convergence rate to the widely adopted Federated Averaging (FedAvg) algorithm, $\mathcal{O}(1/T)$, with the sub-optimality gap enlarged by a constant factor related to the model splitting design in FedPMT. Empirical results show that FedPMT significantly outperforms the existing benchmark FedDrop. Meanwhile, compared to the popular model-homogeneous benchmark, FedAvg, FedPMT reaches the learning target in a shorter completion time, thus achieving a better trade-off between learning accuracy and completion time.
    Emerging Drug Interaction Prediction Enabled by Flow-based Graph Neural Network with Biomedical Network. (arXiv:2311.09261v1 [q-bio.QM])
    Accurately predicting drug-drug interactions (DDI) for emerging drugs, which offer possibilities for treating and alleviating diseases, with computational methods can improve patient care and contribute to efficient drug development. However, many existing computational methods require large amounts of known DDI information, which is scarce for emerging drugs. In this paper, we propose EmerGNN, a graph neural network (GNN) that can effectively predict interactions for emerging drugs by leveraging the rich information in biomedical networks. EmerGNN learns pairwise representations of drugs by extracting the paths between drug pairs, propagating information from one drug to the other, and incorporating the relevant biomedical concepts on the paths. The different edges on the biomedical network are weighted to indicate the relevance for the target DDI prediction. Overall, EmerGNN has higher accuracy than existing approaches in predicting interactions for emerging drugs and can identify the most relevant information on the biomedical network.  ( 2 min )
    Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks. (arXiv:2311.09247v1 [cs.AI])
    We explore the abstract reasoning abilities of text-only and multimodal versions of GPT-4, using the ConceptARC benchmark [10], which is designed to evaluate robust understanding and reasoning with core-knowledge concepts. We extend the work of Moskvichev et al. [10] by evaluating GPT-4 on more detailed, one-shot prompting (rather than simple, zero-shot prompts) with text versions of ConceptARC tasks, and by evaluating GPT-4V, the multimodal version of GPT-4, on zero- and one-shot prompts using image versions of the simplest tasks. Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.  ( 2 min )
    HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs. (arXiv:2311.09774v1 [cs.CL])
    Adapting a language model into a specific domain, a.k.a `domain adaption', is a common practice when specialized knowledge, e.g. medicine, is not encapsulated in a general language model like Llama2. The challenge lies in the heterogeneity of data across the two training stages, as it varies in languages, genres, or formats. To tackle this and simplify the learning protocol, we propose to transform heterogeneous data, from the both pre-training and supervised stages, into a unified, simple input-output pair format. We validate the new protocol in the domains where proprietary LLMs like ChatGPT perform relatively poorly, such as Traditional Chinese Medicine. The developed model, HuatuoGPT-II, has shown state-of-the-art performance in Chinese medicine domain on a number of benchmarks, e.g. medical licensing exams. It even outperforms proprietary models like ChatGPT and GPT-4 in some aspects, especially in Traditional Chinese Medicine. Expert manual evaluations further validate HuatuoGPT-II's advantages over existing LLMs. Notably, HuatuoGPT-II was benchmarked in a fresh Chinese National Medical Licensing Examination where it achieved the best performance, showcasing not only its effectiveness but also its generalization capabilities.  ( 2 min )
    Fast multiplication by two's complement addition of numbers represented as a set of polynomial radix 2 indexes, stored as an integer list for massively parallel computation. (arXiv:2311.09922v1 [cs.MS])
    We demonstrate a multiplication method based on numbers represented as set of polynomial radix 2 indices stored as an integer list. The 'polynomial integer index multiplication' method is a set of algorithms implemented in python code. We demonstrate the method to be faster than both the Number Theoretic Transform (NTT) and Karatsuba for multiplication within a certain bit range. Also implemented in python code for comparison purposes with the polynomial radix 2 integer method. We demonstrate that it is possible to express any integer or real number as a list of integer indices, representing a finite series in base two. The finite series of integer index representation of a number can then be stored and distributed across multiple CPUs / GPUs. We show that operations of addition and multiplication can be applied as two's complement additions operating on the index integer representations and can be fully distributed across a given CPU / GPU architecture. We demonstrate fully distributed arithmetic operations such that the 'polynomial integer index multiplication' method overcomes the current limitation of parallel multiplication methods. Ie, the need to share common core memory and common disk for the calculation of results and intermediate results.
    Generating Drug Repurposing Hypotheses through the Combination of Disease-Specific Hypergraphs. (arXiv:2311.09596v1 [q-bio.BM])
    The drug development pipeline for a new compound can last 10-20 years and cost over 10 billion. Drug repurposing offers a more time- and cost-effective alternative. Computational approaches based on biomedical knowledge graph representations have recently yielded new drug repurposing hypotheses. In this study, we present a novel, disease-specific hypergraph representation learning technique to derive contextual embeddings of biological pathways of various lengths but that all start at any given drug and all end at the disease of interest. Further, we extend this method to multi-disease hypergraphs. To determine the repurposing potential of each of the 1,522 drugs, we derive drug-specific distributions of cosine similarity values and ultimately consider the median for ranking. Cosine similarity values are computed between (1) all biological pathways starting at the considered drug and ending at the disease of interest and (2) all biological pathways starting at drugs currently prescribed against that disease and ending at the disease of interest. We illustrate our approach with Alzheimer's disease (AD) and two of its risk factors: hypertension (HTN) and type 2 diabetes (T2D). We compare each drug's rank across four hypergraph settings (single- or multi-disease): AD only, AD + HTN, AD + T2D, and AD + HTN + T2D. Notably, our framework led to the identification of two promising drugs whose repurposing potential was significantly higher in hypergraphs combining two diseases: dapagliflozin (antidiabetic; moved up, from top 32$\%$ to top 7$\%$, across all considered drugs) and debrisoquine (antihypertensive; moved up, from top 76$\%$ to top 23$\%$). Our approach serves as a hypothesis generation tool, to be paired with a validation pipeline relying on laboratory experiments and semi-automated parsing of the biomedical literature.
    Ulixes: Facial Recognition Privacy with Adversarial Machine Learning. (arXiv:2010.10242v2 [cs.CV] CROSS LISTED)
    Facial recognition tools are becoming exceptionally accurate in identifying people from images. However, this comes at the cost of privacy for users of online services with photo management (e.g. social media platforms). Particularly troubling is the ability to leverage unsupervised learning to recognize faces even when the user has not labeled their images. In this paper we propose Ulixes, a strategy to generate visually non-invasive facial noise masks that yield adversarial examples, preventing the formation of identifiable user clusters in the embedding space of facial encoders. This is applicable even when a user is unmasked and labeled images are available online. We demonstrate the effectiveness of Ulixes by showing that various classification and clustering methods cannot reliably label the adversarial examples we generate. We also study the effects of Ulixes in various black-box settings and compare it to the current state of the art in adversarial machine learning. Finally, we challenge the effectiveness of Ulixes against adversarially trained models and show that it is robust to countermeasures.
    Natural Disaster Analysis using Satellite Imagery and Social-Media Data for Emergency Response Situations. (arXiv:2311.09947v1 [cs.LG])
    Disaster Management is one of the most promising research areas because of its significant economic, environmental and social repercussions. This research focuses on analyzing different types of data (pre and post satellite images and twitter data) related to disaster management for in-depth analysis of location-wise emergency requirements. This research has been divided into two stages, namely, satellite image analysis and twitter data analysis followed by integration using location. The first stage involves pre and post disaster satellite image analysis of the location using multi-class land cover segmentation technique based on U-Net architecture. The second stage focuses on mapping the region with essential information about the disaster situation and immediate requirements for relief operations. The severely affected regions are demarcated and twitter data is extracted using keywords respective to that location. The extraction of situational information from a large corpus of raw tweets adopts Content Word based Tweet Summarization (COWTS) technique. An integration of these modules using real-time location-based mapping and frequency analysis technique gathers multi-dimensional information in the advent of disaster occurrence such as the Kerala and Mississippi floods that were analyzed and validated as test cases. The novelty of this research lies in the application of segmented satellite images for disaster relief using highlighted land cover changes and integration of twitter data by mapping these region-specific filters for obtaining a complete overview of the disaster.  ( 2 min )
    Graph-Guided Reasoning for Multi-Hop Question Answering in Large Language Models. (arXiv:2311.09762v1 [cs.CL])
    Chain-of-Thought (CoT) prompting has boosted the multi-step reasoning capabilities of Large Language Models (LLMs) by generating a series of rationales before the final answer. We analyze the reasoning paths generated by CoT and find two issues in multi-step reasoning: (i) Generating rationales irrelevant to the question, (ii) Unable to compose subquestions or queries for generating/retrieving all the relevant information. To address them, we propose a graph-guided CoT prompting method, which guides the LLMs to reach the correct answer with graph representation/verification steps. Specifically, we first leverage LLMs to construct a "question/rationale graph" by using knowledge extraction prompting given the initial question and the rationales generated in the previous steps. Then, the graph verification step diagnoses the current rationale triplet by comparing it with the existing question/rationale graph to filter out irrelevant rationales and generate follow-up questions to obtain relevant information. Additionally, we generate CoT paths that exclude the extracted graph information to represent the context information missed from the graph extraction. Our graph-guided reasoning method shows superior performance compared to previous CoT prompting and the variants on multi-hop question answering benchmark datasets.
    Scalable Diffusion for Materials Generation. (arXiv:2311.09235v1 [cs.LG])
    Generative models trained on internet-scale data are capable of generating novel and realistic texts, images, and videos. A natural next question is whether these models can advance science, for example by generating novel stable materials. Traditionally, models with explicit structures (e.g., graphs) have been used in modeling structural relationships in scientific data (e.g., atoms and bonds in crystals), but generating structures can be difficult to scale to large and complex systems. Another challenge in generating materials is the mismatch between standard generative modeling metrics and downstream applications. For instance, common metrics such as the reconstruction error do not correlate well with the downstream goal of discovering stable materials. In this work, we tackle the scalability challenge by developing a unified crystal representation that can represent any crystal structure (UniMat), followed by training a diffusion probabilistic model on these UniMat representations. Our empirical results suggest that despite the lack of explicit structure modeling, UniMat can generate high fidelity crystal structures from larger and more complex chemical systems, outperforming previous graph-based approaches under various generative modeling metrics. To better connect the generation quality of materials to downstream applications, such as discovering novel stable materials, we propose additional metrics for evaluating generative models of materials, including per-composition formation energy and stability with respect to convex hulls through decomposition energy from Density Function Theory (DFT). Lastly, we show that conditional generation with UniMat can scale to previously established crystal datasets with up to millions of crystals structures, outperforming random structure search (the current leading method for structure discovery) in discovering new stable materials.  ( 3 min )
    CAPCODRE: A Computational Systems Biology and Machine Learning-Based Approach to Predict Cognitive Disorder Risk in the Elderly. (arXiv:2311.09229v1 [q-bio.NC])
    As global life expectancy improves, the population of the elderly, persons that are aged 65 years and older, is steadily increasing as well. However, with aging populations a greater prevalence of cognitive impairment has emerged, ranging from mild dementia to severe dementias such as Alzheimer's disease due to genetic and environmental influences, among others. The purpose of this research was to develop a computational algorithm to predict the risk of developing cognitive disorders using a dual machine learning and systems biology approach. The proposed method CAPCODRE (Computational Approach to Predict COgnitive Disorder Risk for the Elderly) utilized air, water, and noise environmental pollution data coupled with a gene-protein interaction network, in addition to cognitive impairment hospitalizations in the United States to create a tailorable, interactive network able to predict risk of dementia and Alzheimer's disease. This network was inputted into a random selection optimization algorithm to select optimal training parameters for training via k-nearest neighbors, random forest regression, and decision trees. CAPCODRE was successfully able to predict and model risk of cognitive health issues through measures of specificity, sensitivity, and accuracy of >85%. The algorithm was integrated into an app for users to receive personalized predictions based on their medical history and geographic location. CAPCODRE can point to the extent of environmental pollution on human health and reveal steps to mitigate risk of severe cognitive impairment. This research also has the potential to address racial disparities in cognitive disorder diagnoses and treatment, promoting more equitable and accessible care.  ( 3 min )
    The Cybersecurity Crisis of Artificial Intelligence: Unrestrained Adoption and Natural Language-Based Attacks. (arXiv:2311.09224v1 [cs.CY])
    The widespread integration of autoregressive-large language models (AR-LLMs), such as ChatGPT, across established applications, like search engines, has introduced critical vulnerabilities with uniquely scalable characteristics. In this commentary, we analyse these vulnerabilities, their dependence on natural language as a vector of attack, and their challenges to cybersecurity best practices. We offer recommendations designed to mitigate these challenges.  ( 2 min )
    An Innovative Tool for Uploading/Scraping Large Image Datasets on Social Networks. (arXiv:2311.09237v1 [cs.DL])
    Nowadays, people can retrieve and share digital information in an increasingly easy and fast fashion through the well-known digital platforms, including sensitive data, inappropriate or illegal content, and, in general, information that might serve as probative evidence in court. Consequently, to assess forensics issues, we need to figure out how to trace back to the posting chain of a digital evidence (e.g., a picture, an audio) throughout the involved platforms -- this is what Digital (also Forensics) Ballistics basically deals with. With the entry of Machine Learning as a tool of the trade in many research areas, the need for vast amounts of data has been dramatically increasing over the last few years. However, collecting or simply find the "right" datasets that properly enables data-driven research studies can turn out to be not trivial in some cases, if not extremely challenging, especially when it comes with highly specialized tasks, such as creating datasets analyzed to detect the source media platform of a given digital media. In this paper we propose an automated approach by means of a digital tool that we created on purpose. The tool is capable of automatically uploading an entire image dataset to the desired digital platform and then downloading all the uploaded pictures, thus shortening the overall time required to output the final dataset to be analyzed.  ( 3 min )
    Cross-domain feature disentanglement for interpretable modeling of tumor microenvironment impact on drug response. (arXiv:2311.09264v1 [cs.LG])
    High-throughput screening technology has facilitated the generation of large-scale drug responses across hundreds of cancer cell lines. However, there exists significant discrepancy between in vitro cell lines and actual tumors in vivo in terms of their response to drug treatments, because of tumors comprise of complex cellular compositions and histopathology structure, known as tumor microenvironment (TME), which greatly influences the drug cytotoxicity against tumor cells. To date, no study has focused on modeling the impact of the TME on clinical drug response. This paper proposed a domain adaptation network for feature disentanglement to separate representations of cancer cells and TME of a tumor in patients. Two denoising autoencoders were separately used to extract features from cell lines (source domain) and tumors (target domain) for partial domain alignment and feature decoupling. The specific encoder was enforced to extract information only about TME. Moreover, to ensure generalizability to novel drugs, we applied a graph attention network to learn the latent representation of drugs, allowing us to linearly model the drug perturbation on cellular state in latent space. We calibrated our model on a benchmark dataset and demonstrated its superior performance in predicting clinical drug response and dissecting the influence of the TME on drug efficacy.  ( 2 min )
    The Perception-Robustness Tradeoff in Deterministic Image Restoration. (arXiv:2311.09253v1 [eess.IV])
    We study the behavior of deterministic methods for solving inverse problems in imaging. These methods are commonly designed to achieve two goals: (1) attaining high perceptual quality, and (2) generating reconstructions that are consistent with the measurements. We provide a rigorous proof that the better a predictor satisfies these two requirements, the larger its Lipschitz constant must be, regardless of the nature of the degradation involved. In particular, to approach perfect perceptual quality and perfect consistency, the Lipschitz constant of the model must grow to infinity. This implies that such methods are necessarily more susceptible to adversarial attacks. We demonstrate our theory on single image super-resolution algorithms, addressing both noisy and noiseless settings. We also show how this undesired behavior can be leveraged to explore the posterior distribution, thereby allowing the deterministic model to imitate stochastic methods.  ( 2 min )
    SegMix: A Simple Structure-Aware Data Augmentation Method. (arXiv:2311.09505v1 [cs.CL])
    Interpolation-based Data Augmentation (DA) methods (Mixup) linearly interpolate the inputs and labels of two or more training examples. Mixup has more recently been adapted to the field of Natural Language Processing (NLP), mainly for sequence labeling tasks. However, such a simple adoption yields mixed or unstable improvements over the baseline models. We argue that the direct-adoption methods do not account for structures in NLP tasks. To this end, we propose SegMix, a collection of interpolation-based DA algorithms that can adapt to task-specific structures. SegMix poses fewer constraints on data structures, is robust to various hyperparameter settings, applies to more task settings, and adds little computational overhead. In the algorithm's core, we apply interpolation methods on task-specific meaningful segments, in contrast to applying them on sequences as in prior work. We find SegMix to be a flexible framework that combines rule-based DA methods with interpolation-based methods, creating interesting mixtures of DA techniques. We show that SegMix consistently improves performance over strong baseline models in Named Entity Recognition (NER) and Relation Extraction (RE) tasks, especially under data-scarce settings. Furthermore, this method is easy to implement and adds negligible training overhead.
    DeepEMD: A Transformer-based Fast Estimation of the Earth Mover's Distance. (arXiv:2311.09998v1 [cs.LG])
    The Earth Mover's Distance (EMD) is the measure of choice between point clouds. However the computational cost to compute it makes it prohibitive as a training loss, and the standard approach is to use a surrogate such as the Chamfer distance. We propose an attention-based model to compute an accurate approximation of the EMD that can be used as a training loss for generative models. To get the necessary accurate estimation of the gradients we train our model to explicitly compute the matching between point clouds instead of EMD itself. We cast this new objective as the estimation of an attention matrix that approximates the ground truth matching matrix. Experiments show that this model provides an accurate estimate of the EMD and its gradient with a wall clock speed-up of more than two orders of magnitude with respect to the exact Hungarian matching algorithm and one order of magnitude with respect to the standard approximate Sinkhorn algorithm, allowing in particular to train a point cloud VAE with the EMD itself. Extensive evaluation show the remarkable behaviour of this model when operating out-of-distribution, a key requirement for a distance surrogate. Finally, the model generalizes very well to point clouds during inference several times larger than during training.
    LymphoML: An interpretable artificial intelligence-based method identifies morphologic features that correlate with lymphoma subtype. (arXiv:2311.09574v1 [cs.LG])
    The accurate classification of lymphoma subtypes using hematoxylin and eosin (H&E)-stained tissue is complicated by the wide range of morphological features these cancers can exhibit. We present LymphoML - an interpretable machine learning method that identifies morphologic features that correlate with lymphoma subtypes. Our method applies steps to process H&E-stained tissue microarray cores, segment nuclei and cells, compute features encompassing morphology, texture, and architecture, and train gradient-boosted models to make diagnostic predictions. LymphoML's interpretable models, developed on a limited volume of H&E-stained tissue, achieve non-inferior diagnostic accuracy to pathologists using whole-slide images and outperform black box deep-learning on a dataset of 670 cases from Guatemala spanning 8 lymphoma subtypes. Using SHapley Additive exPlanation (SHAP) analysis, we assess the impact of each feature on model prediction and find that nuclear shape features are most discriminative for DLBCL (F1-score: 78.7%) and classical Hodgkin lymphoma (F1-score: 74.5%). Finally, we provide the first demonstration that a model combining features from H&E-stained tissue with features from a standardized panel of 6 immunostains results in a similar diagnostic accuracy (85.3%) to a 46-stain panel (86.1%).
    Chain of Images for Intuitively Reasoning. (arXiv:2311.09241v1 [cs.CV])
    The human brain is naturally equipped to comprehend and interpret visual information rapidly. When confronted with complex problems or concepts, we use flowcharts, sketches, and diagrams to aid our thought process. Leveraging this inherent ability can significantly enhance logical reasoning. However, current Large Language Models (LLMs) do not utilize such visual intuition to help their thinking. Even the most advanced version language models (e.g., GPT-4V and LLaVA) merely align images into textual space, which means their reasoning processes remain purely verbal. To mitigate such limitations, we present a Chain of Images (CoI) approach, which can convert complex language reasoning problems to simple pattern recognition by generating a series of images as intermediate representations. Furthermore, we have developed a CoI evaluation dataset encompassing 15 distinct domains where images can intuitively aid problem-solving. Based on this dataset, we aim to construct a benchmark to assess the capability of future multimodal large-scale models to leverage images for reasoning. In supporting our CoI reasoning, we introduce a symbolic multimodal large language model (SyMLLM) that generates images strictly based on language instructions and accepts both text and image as input. Experiments on Geometry, Chess and Common Sense tasks sourced from the CoI evaluation dataset show that CoI improves performance significantly over the pure-language Chain of Thoughts (CoT) baselines. The code is available at https://github.com/GraphPKU/CoI.  ( 2 min )
    Affine Invariance in Continuous-Domain Convolutional Neural Networks. (arXiv:2311.09245v1 [cs.LG])
    The notion of group invariance helps neural networks in recognizing patterns and features under geometric transformations. Indeed, it has been shown that group invariance can largely improve deep learning performances in practice, where such transformations are very common. This research studies affine invariance on continuous-domain convolutional neural networks. Despite other research considering isometric invariance or similarity invariance, we focus on the full structure of affine transforms generated by the generalized linear group $\mathrm{GL}_2(\mathbb{R})$. We introduce a new criterion to assess the similarity of two input signals under affine transformations. Then, unlike conventional methods that involve solving complex optimization problems on the Lie group $G_2$, we analyze the convolution of lifted signals and compute the corresponding integration over $G_2$. In sum, our research could eventually extend the scope of geometrical transformations that practical deep-learning pipelines can handle.  ( 2 min )
    Auto-ICL: In-Context Learning without Human Supervision. (arXiv:2311.09263v1 [cs.LG])
    In the era of Large Language Models (LLMs), human-computer interaction has evolved towards natural language, offering unprecedented flexibility. Despite this, LLMs are heavily reliant on well-structured prompts to function efficiently within the realm of In-Context Learning. Vanilla In-Context Learning relies on human-provided contexts, such as labeled examples, explicit instructions, or other guiding mechanisms that shape the model's outputs. To address this challenge, our study presents a universal framework named Automatic In-Context Learning. Upon receiving a user's request, we ask the model to independently generate examples, including labels, instructions, or reasoning pathways. The model then leverages this self-produced context to tackle the given problem. Our approach is universally adaptable and can be implemented in any setting where vanilla In-Context Learning is applicable. We demonstrate that our method yields strong performance across a range of tasks, standing up well when compared to existing methods.  ( 2 min )
    A Simple and Powerful Framework for Stable Dynamic Network Embedding. (arXiv:2311.09251v1 [cs.SI])
    In this paper, we address the problem of dynamic network embedding, that is, representing the nodes of a dynamic network as evolving vectors within a low-dimensional space. While the field of static network embedding is wide and established, the field of dynamic network embedding is comparatively in its infancy. We propose that a wide class of established static network embedding methods can be used to produce interpretable and powerful dynamic network embeddings when they are applied to the dilated unfolded adjacency matrix. We provide a theoretical guarantee that, regardless of embedding dimension, these unfolded methods will produce stable embeddings, meaning that nodes with identical latent behaviour will be exchangeable, regardless of their position in time or space. We additionally define a hypothesis testing framework which can be used to evaluate the quality of a dynamic network embedding by testing for planted structure in simulated networks. Using this, we demonstrate that, even in trivial cases, unstable methods are often either conservative or encode incorrect structure. In contrast, we demonstrate that our suite of stable unfolded methods are not only more interpretable but also more powerful in comparison to their unstable counterparts.  ( 2 min )
    Neural Packing: from Visual Sensing to Reinforcement Learning. (arXiv:2311.09233v1 [cs.LG])
    We present a novel learning framework to solve the transport-and-packing (TAP) problem in 3D. It constitutes a full solution pipeline from partial observations of input objects via RGBD sensing and recognition to final box placement, via robotic motion planning, to arrive at a compact packing in a target container. The technical core of our method is a neural network for TAP, trained via reinforcement learning (RL), to solve the NP-hard combinatorial optimization problem. Our network simultaneously selects an object to pack and determines the final packing location, based on a judicious encoding of the continuously evolving states of partially observed source objects and available spaces in the target container, using separate encoders both enabled with attention mechanisms. The encoded feature vectors are employed to compute the matching scores and feasibility masks of different pairings of box selection and available space configuration for packing strategy optimization. Extensive experiments, including ablation studies and physical packing execution by a real robot (Universal Robot UR5e), are conducted to evaluate our method in terms of its design choices, scalability, generalizability, and comparisons to baselines, including the most recent RL-based TAP solution. We also contribute the first benchmark for TAP which covers a variety of input settings and difficulty levels.  ( 2 min )
  • Open

    Modelling daily mobility using mobile data traffic at fine spatiotemporal scale. (arXiv:2311.09683v1 [cs.LG])
    We applied a data-driven approach that explores the usability of the NetMob 2023 dataset in modelling mobility patterns within an urban context. We combined the data with a highly suitable external source, the ENACT dataset, which provides a 1 km x 1km grid with estimates of the day and night population across Europe. We developed three sets of XGBoost models that predict the population in each 100m x 100m grid cell used in NetMob2023 based on the mobile data traffic of the 68 online services covered in the dataset, using the ENACT values as ground truth. The results suggest that the NetMob 2023 data can be useful for the estimation of the day and night population and grid cell level and can explain part of the dynamics of urban mobility.  ( 2 min )
    Improved Bayes Risk Can Yield Reduced Social Welfare Under Competition. (arXiv:2306.14670v2 [cs.GT] UPDATED)
    As the scale of machine learning models increases, trends such as scaling laws anticipate consistent downstream improvements in predictive accuracy. However, these trends take the perspective of a single model-provider in isolation, while in reality providers often compete with each other for users. In this work, we demonstrate that competition can fundamentally alter the behavior of these scaling trends, even causing overall predictive accuracy across users to be non-monotonic or decreasing with scale. We define a model of competition for classification tasks, and use data representations as a lens for studying the impact of increases in scale. We find many settings where improving data representation quality (as measured by Bayes risk) decreases the overall predictive accuracy across users (i.e., social welfare) for a marketplace of competing model-providers. Our examples range from closed-form formulas in simple settings to simulations with pretrained representations on CIFAR-10. At a conceptual level, our work suggests that favorable scaling trends for individual model-providers need not translate to downstream improvements in social welfare in marketplaces with multiple model providers.  ( 2 min )
    Conditional Matrix Flows for Gaussian Graphical Models. (arXiv:2306.07255v2 [cs.LG] UPDATED)
    Studying conditional independence among many variables with few observations is a challenging task. Gaussian Graphical Models (GGMs) tackle this problem by encouraging sparsity in the precision matrix through $l_q$ regularization with $q\leq1$. However, most GMMs rely on the $l_1$ norm because the objective is highly non-convex for sub-$l_1$ pseudo-norms. In the frequentist formulation, the $l_1$ norm relaxation provides the solution path as a function of the shrinkage parameter $\lambda$. In the Bayesian formulation, sparsity is instead encouraged through a Laplace prior, but posterior inference for different $\lambda$ requires repeated runs of expensive Gibbs samplers. Here we propose a general framework for variational inference with matrix-variate Normalizing Flow in GGMs, which unifies the benefits of frequentist and Bayesian frameworks. As a key improvement on previous work, we train with one flow a continuum of sparse regression models jointly for all regularization parameters $\lambda$ and all $l_q$ norms, including non-convex sub-$l_1$ pseudo-norms. Within one model we thus have access to (i) the evolution of the posterior for any $\lambda$ and any $l_q$ (pseudo-) norm, (ii) the marginal log-likelihood for model selection, and (iii) the frequentist solution paths through simulated annealing in the MAP limit.  ( 2 min )
    Open Problem: Learning with Variational Objectives on Measures. (arXiv:2306.11928v2 [stat.ML] UPDATED)
    The theory of statistical learning has focused on variational objectives expressed on functions. In this note, we discuss motivations to write similar objectives on measures, in particular to discuss out-of-distribution generalization and weakly-supervised learning. It raises a natural question: can one cast usual statistical learning results to objectives expressed on measures? Does the resulting construction lead to new algorithms of practical interest?  ( 2 min )
    Comparison of Methods that Combine Multiple Randomized Trials to Estimate Heterogeneous Treatment Effects. (arXiv:2303.16299v2 [stat.ME] UPDATED)
    Individualized treatment decisions can improve health outcomes, but using data to make these decisions in a reliable, precise, and generalizable way is challenging with a single dataset. Leveraging multiple randomized controlled trials allows for the combination of datasets with unconfounded treatment assignment to better estimate heterogeneous treatment effects. This paper discusses several non-parametric approaches for estimating heterogeneous treatment effects using data from multiple trials. We extend single-study methods to a scenario with multiple trials and explore their performance through a simulation study, with data generation scenarios that have differing levels of cross-trial heterogeneity. The simulations demonstrate that methods that directly allow for heterogeneity of the treatment effect across trials perform better than methods that do not, and that the choice of single-study method matters based on the functional form of the treatment effect. Finally, we discuss which methods perform well in each setting and then apply them to four randomized controlled trials to examine effect heterogeneity of treatments for major depressive disorder.  ( 2 min )
    Gradients Look Alike: Sensitivity is Often Overestimated in DP-SGD. (arXiv:2307.00310v2 [cs.LG] UPDATED)
    Differentially private stochastic gradient descent (DP-SGD) is the canonical approach to private deep learning. While the current privacy analysis of DP-SGD is known to be tight in some settings, several empirical results suggest that models trained on common benchmark datasets leak significantly less privacy for many datapoints. Yet, despite past attempts, a rigorous explanation for why this is the case has not been reached. Is it because there exist tighter privacy upper bounds when restricted to these dataset settings, or are our attacks not strong enough for certain datapoints? In this paper, we provide the first per-instance (i.e., ``data-dependent") DP analysis of DP-SGD. Our analysis captures the intuition that points with similar neighbors in the dataset enjoy better data-dependent privacy than outliers. Formally, this is done by modifying the per-step privacy analysis of DP-SGD to introduce a dependence on the distribution of model updates computed from a training dataset. We further develop a new composition theorem to effectively use this new per-step analysis to reason about an entire training run. Put all together, our evaluation shows that this novel DP-SGD analysis allows us to now formally show that DP-SGD leaks significantly less privacy for many datapoints (when trained on common benchmarks) than the current data-independent guarantee. This implies privacy attacks will necessarily fail against many datapoints if the adversary does not have sufficient control over the possible training datasets.  ( 3 min )
    Learning to Reconstruct Signals From Binary Measurements. (arXiv:2303.08691v3 [eess.SP] UPDATED)
    Recent advances in unsupervised learning have highlighted the possibility of learning to reconstruct signals from noisy and incomplete linear measurements alone. These methods play a key role in medical and scientific imaging and sensing, where ground truth data is often scarce or difficult to obtain. However, in practice, measurements are not only noisy and incomplete but also quantized. Here we explore the extreme case of learning from binary observations and provide necessary and sufficient conditions on the number of measurements required for identifying a set of signals from incomplete binary data. Our results are complementary to existing bounds on signal recovery from binary measurements. Furthermore, we introduce a novel self-supervised learning approach, which we name SSBM, that only requires binary data for training. We demonstrate in a series of experiments with real datasets that SSBM performs on par with supervised learning and outperforms sparse reconstruction methods with a fixed wavelet basis by a large margin.  ( 2 min )
    Investigating the Corruption Robustness of Image Classifiers with Random Lp-norm Corruptions. (arXiv:2305.05400v2 [cs.LG] UPDATED)
    Robustness is a fundamental property of machine learning classifiers to achieve safety and reliability. In the fields of adversarial robustness and formal robustness verification of image classification models, robustness is commonly defined as the stability to all input variations within an Lp-norm distance. However, robustness to random corruptions is usually improved and evaluated using variations observed in the real-world, while mathematically defined Lp-norm corruptions are rarely considered. This study investigates the use of random Lp-norm corruptions to augment the training and test data of image classifiers. We adapt an approach from the field of adversarial robustness to assess the model robustness to imperceptible random corruptions. We empirically and theoretically investigate whether robustness is transferable across different Lp-norms and derive conclusions on which Lp-norm corruptions a model should be trained and evaluated on. We find that training data augmentation with L0-norm corruptions improves corruption robustness while maintaining accuracy compared to standard training and when applied on top of selected state-of-the-art data augmentation techniques.  ( 2 min )
    Fair Data Representation for Machine Learning at the Pareto Frontier. (arXiv:2201.00292v3 [stat.ML] UPDATED)
    As machine learning powered decision-making becomes increasingly important in our daily lives, it is imperative to strive for fairness in the underlying data processing. We propose a pre-processing algorithm for fair data representation via which supervised learning results in estimations of the Pareto frontier between prediction error and statistical disparity. Particularly, the present work applies the optimal affine transport to approach the post-processing Wasserstein-2 barycenter characterization of the optimal fair $L^2$-objective supervised learning via a pre-processing data deformation. Furthermore, we show that the Wasserstein-2 geodesics from the conditional (on sensitive information) distributions of the learning outcome to their barycenter characterizes the Pareto frontier between $L^2$-loss and the average pairwise Wasserstein-2 distance among sensitive groups on the learning outcome. Numerical simulations underscore the advantages: (1) the pre-processing step is compositive with arbitrary conditional expectation estimation supervised learning methods and unseen data; (2) the fair representation protects the sensitive information by limiting the inference capability of the remaining data with respect to the sensitive data; (3) the optimal affine maps are computationally efficient even for high-dimensional data.  ( 2 min )
    Private estimation algorithms for stochastic block models and mixture models. (arXiv:2301.04822v2 [cs.DS] UPDATED)
    We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient $(\epsilon, \delta)$-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an $(\epsilon, \delta)$-differentially private algorithm that recovers the centers of the $k$-mixture when the minimum separation is at least $ O(k^{1/t}\sqrt{t})$. For all choices of $t$, this algorithm requires sample complexity $n\geq k^{O(1)}d^{O(t)}$ and time complexity $(nd)^{O(t)}$. Prior work required minimum separation at least $O(\sqrt{k})$ as well as an explicit upper bound on the Euclidean norm of the centers.  ( 2 min )
    Show Your Work with Confidence: Confidence Bands for Tuning Curves. (arXiv:2311.09480v1 [cs.CL])
    The choice of hyperparameters greatly impacts performance in natural language processing. Often, it is hard to tell if a method is better than another or just better tuned. Tuning curves fix this ambiguity by accounting for tuning effort. Specifically, they plot validation performance as a function of the number of hyperparameter choices tried so far. While several estimators exist for these curves, it is common to use point estimates, which we show fail silently and give contradictory results when given too little data. Beyond point estimates, confidence bands are necessary to rigorously establish the relationship between different approaches. We present the first method to construct valid confidence bands for tuning curves. The bands are exact, simultaneous, and distribution-free, thus they provide a robust basis for comparing methods. Empirical analysis shows that while bootstrap confidence bands, which serve as a baseline, fail to approximate their target confidence, ours achieve it exactly. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method. To promote confident comparisons in future work, we release a library implementing the method at https://github.com/nalourie/opda .  ( 2 min )
    Score-based generative models learn manifold-like structures with constrained mixing. (arXiv:2311.09952v1 [stat.ML])
    How do score-based generative models (SBMs) learn the data distribution supported on a low-dimensional manifold? We investigate the score model of a trained SBM through its linear approximations and subspaces spanned by local feature vectors. During diffusion as the noise decreases, the local dimensionality increases and becomes more varied between different sample sequences. Importantly, we find that the learned vector field mixes samples by a non-conservative field within the manifold, although it denoises with normal projections as if there is an energy function in off-manifold directions. At each noise level, the subspace spanned by the local features overlap with an effective density function. These observations suggest that SBMs can flexibly mix samples with the learned score field while carefully maintaining a manifold-like structure of the data distribution.  ( 2 min )
    Representational Strengths and Limitations of Transformers. (arXiv:2306.02896v2 [cs.LG] UPDATED)
    Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both positive and negative results on the representation power of attention layers, with a focus on intrinsic complexity parameters such as width, depth, and embedding dimension. On the positive side, we present a sparse averaging task, where recurrent networks and feedforward networks all have complexity scaling polynomially in the input size, whereas transformers scale merely logarithmically in the input size; furthermore, we use the same construction to show the necessity and role of a large embedding dimension in a transformer. On the negative side, we present a triple detection task, where attention layers in turn have complexity scaling linearly in the input size; as this scenario seems rare in practice, we also present natural variants that can be efficiently solved by attention layers. The proof techniques emphasize the value of communication complexity in the analysis of transformers and related models, and the role of sparse averaging as a prototypical attention task, which even finds use in the analysis of triple detection.  ( 2 min )
    Categorical Distributions of Maximum Entropy under Marginal Constraints. (arXiv:2204.03406v2 [hep-th] UPDATED)
    The estimation of categorical distributions under marginal constraints summarizing some sample from a population in the most-generalizable way is key for many machine-learning and data-driven approaches. We provide a parameter-agnostic theoretical framework that enables this task ensuring (i) that a categorical distribution of Maximum Entropy under marginal constraints always exists and (ii) that it is unique. The procedure of iterative proportional fitting (IPF) naturally estimates that distribution from any consistent set of marginal constraints directly in the space of probabilities, thus deductively identifying a least-biased characterization of the population. The theoretical framework together with IPF leads to a holistic workflow that enables modeling any class of categorical distributions solely using the phenomenological information provided.  ( 2 min )
    Spatial Bayesian Neural Networks. (arXiv:2311.09491v1 [stat.ML])
    Statistical models for spatial processes play a central role in statistical analyses of spatial data. Yet, it is the simple, interpretable, and well understood models that are routinely employed even though, as is revealed through prior and posterior predictive checks, these can poorly characterise the spatial heterogeneity in the underlying process of interest. Here, we propose a new, flexible class of spatial-process models, which we refer to as spatial Bayesian neural networks (SBNNs). An SBNN leverages the representational capacity of a Bayesian neural network; it is tailored to a spatial setting by incorporating a spatial "embedding layer" into the network and, possibly, spatially-varying network parameters. An SBNN is calibrated by matching its finite-dimensional distribution at locations on a fine gridding of space to that of a target process of interest. That process could be easy to simulate from or we have many realisations from it. We propose several variants of SBNNs, most of which are able to match the finite-dimensional distribution of the target process at the selected grid better than conventional BNNs of similar complexity. We also show that a single SBNN can be used to represent a variety of spatial processes often used in practice, such as Gaussian processes and lognormal processes. We briefly discuss the tools that could be used to make inference with SBNNs, and we conclude with a discussion of their advantages and limitations.  ( 2 min )
    Adversarial Examples Exist in Two-Layer ReLU Networks for Low Dimensional Linear Subspaces. (arXiv:2303.00783v2 [cs.LG] UPDATED)
    Despite a great deal of research, it is still not well-understood why trained neural networks are highly vulnerable to adversarial examples. In this work we focus on two-layer neural networks trained using data which lie on a low dimensional linear subspace. We show that standard gradient methods lead to non-robust neural networks, namely, networks which have large gradients in directions orthogonal to the data subspace, and are susceptible to small adversarial $L_2$-perturbations in these directions. Moreover, we show that decreasing the initialization scale of the training algorithm, or adding $L_2$ regularization, can make the trained network more robust to adversarial perturbations orthogonal to the data.  ( 2 min )
    Time-dependent Probabilistic Generative Models for Disease Progression. (arXiv:2311.09369v1 [stat.ML])
    Electronic health records contain valuable information for monitoring patients' health trajectories over time. Disease progression models have been developed to understand the underlying patterns and dynamics of diseases using these data as sequences. However, analyzing temporal data from EHRs is challenging due to the variability and irregularities present in medical records. We propose a Markovian generative model of treatments developed to (i) model the irregular time intervals between medical events; (ii) classify treatments into subtypes based on the patient sequence of medical events and the time intervals between them; and (iii) segment treatments into subsequences of disease progression patterns. We assume that sequences have an associated structure of latent variables: a latent class representing the different subtypes of treatments; and a set of latent stages indicating the phase of progression of the treatments. We use the Expectation-Maximization algorithm to learn the model, which is efficiently solved with a dynamic programming-based method. Various parametric models have been employed to model the time intervals between medical events during the learning process, including the geometric, exponential, and Weibull distributions. The results demonstrate the effectiveness of our model in recovering the underlying model from data and accurately modeling the irregular time intervals between medical actions.  ( 2 min )
    Accelerating material discovery with a threshold-driven hybrid acquisition policy-based Bayesian optimization. (arXiv:2311.09591v1 [cs.LG])
    Advancements in materials play a crucial role in technological progress. However, the process of discovering and developing materials with desired properties is often impeded by substantial experimental costs, extensive resource utilization, and lengthy development periods. To address these challenges, modern approaches often employ machine learning (ML) techniques such as Bayesian Optimization (BO), which streamline the search for optimal materials by iteratively selecting experiments that are most likely to yield beneficial results. However, traditional BO methods, while beneficial, often struggle with balancing the trade-off between exploration and exploitation, leading to sub-optimal performance in material discovery processes. This paper introduces a novel Threshold-Driven UCB-EI Bayesian Optimization (TDUE-BO) method, which dynamically integrates the strengths of Upper Confidence Bound (UCB) and Expected Improvement (EI) acquisition functions to optimize the material discovery process. Unlike the classical BO, our method focuses on efficiently navigating the high-dimensional material design space (MDS). TDUE-BO begins with an exploration-focused UCB approach, ensuring a comprehensive initial sweep of the MDS. As the model gains confidence, indicated by reduced uncertainty, it transitions to the more exploitative EI method, focusing on promising areas identified earlier. The UCB-to-EI switching policy dictated guided through continuous monitoring of the model uncertainty during each step of sequential sampling results in navigating through the MDS more efficiently while ensuring rapid convergence. The effectiveness of TDUE-BO is demonstrated through its application on three different material datasets, showing significantly better approximation and optimization performance over the EI and UCB-based BO methods in terms of the RMSE scores and convergence efficiency, respectively.  ( 3 min )
    A Simple and Powerful Framework for Stable Dynamic Network Embedding. (arXiv:2311.09251v1 [cs.SI])
    In this paper, we address the problem of dynamic network embedding, that is, representing the nodes of a dynamic network as evolving vectors within a low-dimensional space. While the field of static network embedding is wide and established, the field of dynamic network embedding is comparatively in its infancy. We propose that a wide class of established static network embedding methods can be used to produce interpretable and powerful dynamic network embeddings when they are applied to the dilated unfolded adjacency matrix. We provide a theoretical guarantee that, regardless of embedding dimension, these unfolded methods will produce stable embeddings, meaning that nodes with identical latent behaviour will be exchangeable, regardless of their position in time or space. We additionally define a hypothesis testing framework which can be used to evaluate the quality of a dynamic network embedding by testing for planted structure in simulated networks. Using this, we demonstrate that, even in trivial cases, unstable methods are often either conservative or encode incorrect structure. In contrast, we demonstrate that our suite of stable unfolded methods are not only more interpretable but also more powerful in comparison to their unstable counterparts.  ( 2 min )
    Urban traffic congestion control: a DeePC change. (arXiv:2311.09851v1 [eess.SY])
    Urban traffic congestion remains a pressing challenge in our rapidly expanding cities, despite the abundance of available data and the efforts of policymakers. By leveraging behavioral system theory and data-driven control, this paper exploits the DeePC algorithm in the context of urban traffic control performed via dynamic traffic lights. To validate our approach, we consider a high-fidelity case study using the state-of-the-art simulation software package Simulation of Urban MObility (SUMO). Preliminary results indicate that DeePC outperforms existing approaches across various key metrics, including travel time and CO$_2$ emissions, demonstrating its potential for effective traffic management  ( 2 min )
    AMLB: an AutoML Benchmark. (arXiv:2207.12560v2 [cs.LG] UPDATED)
    Comparing different AutoML frameworks is notoriously challenging and often done incorrectly. We introduce an open and extensible benchmark that follows best practices and avoids common mistakes when comparing AutoML frameworks. We conduct a thorough comparison of 9 well-known AutoML frameworks across 71 classification and 33 regression tasks. The differences between the AutoML frameworks are explored with a multi-faceted analysis, evaluating model accuracy, its trade-offs with inference time, and framework failures. We also use Bradley-Terry trees to discover subsets of tasks where the relative AutoML framework rankings differ. The benchmark comes with an open-source tool that integrates with many AutoML frameworks and automates the empirical evaluation process end-to-end: from framework installation and resource allocation to in-depth evaluation. The benchmark uses public data sets, can be easily extended with other AutoML frameworks and tasks, and has a website with up-to-date results.  ( 2 min )
    Online Optimization for Network Resource Allocation and Comparison with Reinforcement Learning Techniques. (arXiv:2311.10023v1 [stat.ML])
    We tackle in this paper an online network resource allocation problem with job transfers. The network is composed of many servers connected by communication links. The system operates in discrete time; at each time slot, the administrator reserves resources at servers for future job requests, and a cost is incurred for the reservations made. Then, after receptions, the jobs may be transferred between the servers to best accommodate the demands. This incurs an additional transport cost. Finally, if a job request cannot be satisfied, there is a violation that engenders a cost to pay for the blocked job. We propose a randomized online algorithm based on the exponentially weighted method. We prove that our algorithm enjoys a sub-linear in time regret, which indicates that the algorithm is adapting and learning from its experiences and is becoming more efficient in its decision-making as it accumulates more data. Moreover, we test the performance of our algorithm on artificial data and compare it against a reinforcement learning method where we show that our proposed method outperforms the latter.  ( 2 min )
    Bilevel Optimization with a Lower-level Contraction: Optimal Sample Complexity without Warm-start. (arXiv:2202.03397v4 [stat.ML] UPDATED)
    We analyse a general class of bilevel problems, in which the upper-level problem consists in the minimization of a smooth objective function and the lower-level problem is to find the fixed point of a smooth contraction map. This type of problems include instances of meta-learning, equilibrium models, hyperparameter optimization and data poisoning adversarial attacks. Several recent works have proposed algorithms which warm-start the lower-level problem, i.e.~they use the previous lower-level approximate solution as a staring point for the lower-level solver. This warm-start procedure allows one to improve the sample complexity in both the stochastic and deterministic settings, achieving in some cases the order-wise optimal sample complexity. However, there are situations, e.g., meta learning and equilibrium models, in which the warm-start procedure is not well-suited or ineffective. In this work we show that without warm-start, it is still possible to achieve order-wise (near) optimal sample complexity. In particular, we propose a simple method which uses (stochastic) fixed point iterations at the lower-level and projected inexact gradient descent at the upper-level, that reaches an $\epsilon$-stationary point using $O(\epsilon^{-2})$ and $\tilde{O}(\epsilon^{-1})$ samples for the stochastic and the deterministic setting, respectively. Finally, compared to methods using warm-start, our approach yields a simpler analysis that does not need to study the coupled interactions between the upper-level and lower-level iterates.  ( 3 min )
    Co-data Learning for Bayesian Additive Regression Trees. (arXiv:2311.09997v1 [stat.ML])
    Medical prediction applications often need to deal with small sample sizes compared to the number of covariates. Such data pose problems for prediction and variable selection, especially when the covariate-response relationship is complicated. To address these challenges, we propose to incorporate co-data, i.e. external information on the covariates, into Bayesian additive regression trees (BART), a sum-of-trees prediction model that utilizes priors on the tree parameters to prevent overfitting. To incorporate co-data, an empirical Bayes (EB) framework is developed that estimates, assisted by a co-data model, prior covariate weights in the BART model. The proposed method can handle multiple types of co-data simultaneously. Furthermore, the proposed EB framework enables the estimation of the other hyperparameters of BART as well, rendering an appealing alternative to cross-validation. We show that the method finds relevant covariates and that it improves prediction compared to default BART in simulations. If the covariate-response relationship is nonlinear, the method benefits from the flexibility of BART to outperform regression-based co-data learners. Finally, the use of co-data enhances prediction in an application to diffuse large B-cell lymphoma prognosis based on clinical covariates, gene mutations, DNA translocations, and DNA copy number data. Keywords: Bayesian additive regression trees; Empirical Bayes; Co-data; High-dimensional data; Omics; Prediction  ( 2 min )
    Bayesian Learning via Q-Exponential Process. (arXiv:2210.07987v3 [stat.ME] UPDATED)
    Regularization is one of the most fundamental topics in optimization, statistics and machine learning. To get sparsity in estimating a parameter $u\in\mathbb{R}^d$, an $\ell_q$ penalty term, $\Vert u\Vert_q$, is usually added to the objective function. What is the probabilistic distribution corresponding to such $\ell_q$ penalty? What is the correct stochastic process corresponding to $\Vert u\Vert_q$ when we model functions $u\in L^q$? This is important for statistically modeling large dimensional objects, e.g. images, with penalty to preserve certainty properties, e.g. edges in the image. In this work, we generalize the $q$-exponential distribution (with density proportional to) $\exp{(- \frac{1}{2}|u|^q)}$ to a stochastic process named $Q$-exponential (Q-EP) process that corresponds to the $L_q$ regularization of functions. The key step is to specify consistent multivariate $q$-exponential distributions by choosing from a large family of elliptic contour distributions. The work is closely related to Besov process which is usually defined by the expanded series. Q-EP can be regarded as a definition of Besov process with explicit probabilistic formulation and direct control on the correlation length. From the Bayesian perspective, Q-EP provides a flexible prior on functions with sharper penalty ($q<2$) than the commonly used Gaussian process (GP). We compare GP, Besov and Q-EP in modeling functional data, reconstructing images, and solving inverse problems and demonstrate the advantage of our proposed methodology.  ( 2 min )
    Rethinking Fano's Inequality in Ensemble Learning. (arXiv:2205.12683v2 [cs.LG] UPDATED)
    We propose a fundamental theory on ensemble learning that answers the central question: what factors make an ensemble system good or bad? Previous studies used a variant of Fano's inequality of information theory and derived a lower bound of the classification error rate on the basis of the $\textit{accuracy}$ and $\textit{diversity}$ of models. We revisit the original Fano's inequality and argue that the studies did not take into account the information lost when multiple model predictions are combined into a final prediction. To address this issue, we generalize the previous theory to incorporate the information loss, which we name $\textit{combination loss}$. Further, we empirically validate and demonstrate the proposed theory through extensive experiments on actual systems. The theory reveals the strengths and weaknesses of systems on each metric, which will push the theoretical understanding of ensemble learning and give us insights into designing systems.  ( 2 min )
    Affine Invariance in Continuous-Domain Convolutional Neural Networks. (arXiv:2311.09245v1 [cs.LG])
    The notion of group invariance helps neural networks in recognizing patterns and features under geometric transformations. Indeed, it has been shown that group invariance can largely improve deep learning performances in practice, where such transformations are very common. This research studies affine invariance on continuous-domain convolutional neural networks. Despite other research considering isometric invariance or similarity invariance, we focus on the full structure of affine transforms generated by the generalized linear group $\mathrm{GL}_2(\mathbb{R})$. We introduce a new criterion to assess the similarity of two input signals under affine transformations. Then, unlike conventional methods that involve solving complex optimization problems on the Lie group $G_2$, we analyze the convolution of lifted signals and compute the corresponding integration over $G_2$. In sum, our research could eventually extend the scope of geometrical transformations that practical deep-learning pipelines can handle.  ( 2 min )
    Soft Matching Distance: A metric on neural representations that captures single-neuron tuning. (arXiv:2311.09466v1 [cs.LG])
    Common measures of neural representational (dis)similarity are designed to be insensitive to rotations and reflections of the neural activation space. Motivated by the premise that the tuning of individual units may be important, there has been recent interest in developing stricter notions of representational (dis)similarity that require neurons to be individually matched across networks. When two networks have the same size (i.e. same number of neurons), a distance metric can be formulated by optimizing over neuron index permutations to maximize tuning curve alignment. However, it is not clear how to generalize this metric to measure distances between networks with different sizes. Here, we leverage a connection to optimal transport theory to derive a natural generalization based on "soft" permutations. The resulting metric is symmetric, satisfies the triangle inequality, and can be interpreted as a Wasserstein distance between two empirical distributions. Further, our proposed metric avoids counter-intuitive outcomes suffered by alternative approaches, and captures complementary geometric insights into neural representations that are entirely missed by rotation-invariant metrics.  ( 2 min )

  • Open

    ASRock wants to make generative AI software easier to use on Radeon RX 7000 series with its new AI QuickSet app
    submitted by /u/Aaadvarke [link] [comments]
    Sam Altman fired as CEO of OpenAI
    Sam Altman has been fired as CEO of OpenAI, the company announced on Friday. “Mr. Altman’s departure follows a deliberative review process by the board, which concluded that he was not consistently candid in his communications with the board, hindering its ability to exercise its responsibilities,” the company said in its blog post. “The board no longer has confidence in his ability to continue leading OpenAI.” Chief technology officer Mira Murati will be the interim CEO, effective immediately. The company will be conducting a search for the permanent CEO successor. When contacted by The Verge, OpenAI’s communications department declined to comment beyond the blog post. Sources: https://www.theverge.com/2023/11/17/23965982/openai-ceo-sam-altman-fired submitted by /u/Excellent-Target-847 [link] [comments]
    Sam Altman fired as CEO of OpenAI
    Sam Altman has been fired as the CEO of OpenAI following a board review that questioned his candor in communications, with Mira Murati stepping in as interim CEO. submitted by /u/Remarkable_Ad9528 [link] [comments]
    OpenAI Co-Founder and Chief Scientist says that GPT's architecture, Transformers, can obviously get us to AGI
    Ilya Sutskever, Co-Founder and Chief Scientist at OpenAI, that developed ChatGPT, says that GPT's architecture, Transformers, can obviously get us to AGI. He also adds: We shouldn't don't think about it in terms of binary "is it enough", but "how much effort, what will be the cost of using this particular architecture"? Maybe some modification, can have enough computation efficiency benefits. Specialized brain regions are not fully hardcoded, but very adaptible and plastic. Human cortex is very uniform. You just need one big uniform architecture. Video form: https://twitter.com/burny_tech/status/1725578088392573038 Interviewer: One question I've heard people debate a little bit is the degree to which the Transformer based models can be applied to sort of the full set of areas that you'd…
    This ChatGPT Frontend Clone was Coded by GPT4 Vision
    submitted by /u/NuseAI [link] [comments]
    WebsiteAi X GetReach
    WebsiteAI x GetReach ​ GetReach offers an entire ecosystem of utilities within Discord. As we look to broaden our scope into Discord, we also look to integrate our website bot in Discord too. This goes hand in hand with what GetReach offers. ​ GetReach offers missions, very much like Shield in Telegram! Missions completed earns the user with ETH rewards to claim ✅ ​ GetReach is backed by some of the most influential Cryptopreneurs in the space, especially within the NFT scene; including CryptoPunks, Bored Ape Yacht Club etc🤟 ​ As GetReach extends its developments, we will be mirroring them hand in hand🤝 ​ We look forward to start our partnership with our first Co-Hosting Twitter Space early next week! 🕊 https://preview.redd.it/pn08tt0wky0c1.png?width=1280&format=png&auto=webp&s=efd2f49e2508e639e0cf7642851ea940a08c77a1 submitted by /u/balok1232 [link] [comments]
    Google Enhances Holiday Shopping with AI-Powered Features in Search Generative Experience Tool
    submitted by /u/Upbeat-Interaction13 [link] [comments]
    AI — weekly megathread!
    News provided by aibrews.com Meta AI introduces: Emu Video: new text-to-video model that leverages Meta’s Emu image generation model and can respond to text-only, image-only or combined text & image inputs to generate high quality video [Details]. Emu Edit: This new model is capable of free-form editing through text instructions. Emu Edit precisely follows instructions, ensuring that pixels in the input image unrelated to the instructions remain untouched [Details]. Researchers present LLaVA-Plus, a general-purpose multimodal assistant that expands the capabilities of large multimodal models. LLaVA-Plus maintains a skill repository that contains a wide range of vision and vision-language pre-trained models (tools), and is able to activate relevant tools, given users’ multimodal in…
    Wow
    submitted by /u/mravoid1 [link] [comments]
    Is there an AI that can help me not be mentally ill anymore?
    Or do i have to wait until they invent assisted suicide bots? Fml submitted by /u/ThrowRA21458910 [link] [comments]
    AI's take on schizophrenia
    submitted by /u/LightOfAntara [link] [comments]
    Augmented Rapper, can anybody help me to make this AI app? I got more specs
    submitted by /u/Niu_Davinci [link] [comments]
    "Deepfakes: A Double-Edged Sword in Creativity and Ethical Challenges"
    submitted by /u/Fit-Code-5141 [link] [comments]
    Looking for AI to remove backgrounds of images with context!
    Hello folks! I am looking for a way to mass remove backgrounds of images, using newer methods of AI. There are of course plenty of resources available that remove the backgrounds, but usually, these methods only work well for contrasting colors with clear defined edges. I have to remove bulk backgrounds of products on a white background and want to use AI, because some objects contain white (which is hard for the old-school algorithms to decipher) or are even transparent, and I need a reliable way to process thousands of images with high accuracy (>95%). From there I can define edge cases and figure out which product categories need a different method. One example of this that has worked well for me so far, is Photoshop 2024's select subject tool, with the dropdown "cloud-based" selected. This seems to analyze the image, and more "understand" the context for a proper selection, instead of the regular way where it calculated where an edge is. I wonder if I could use the API for Dall-E 3 for this, etc. I'm looking forward to hear your suggestions and if you can help me brainstorm! submitted by /u/hommelbips [link] [comments]
    Google AI outperforms traditional weather forecasting: Accurate predictions 10 days ahead without a supercomputer
    submitted by /u/Jariiari7 [link] [comments]
    One-Minute Daily AI News 11/16/2023
    Notion Unveils AI-Assisted Q&A Feature Starting At $8.[1] Google is embedding inaudible watermarks right into its AI generated music.[2] Microsoft Unveils Text-to-Speech Avatar Azure AI Speech.[3] Microsoft Unveils AI Chip That Could Replace Nvidia in Its Cloud.[4] Sources: [1] https://www.benzinga.com/news/23/11/35798312/notion-unveils-ai-assisted-q-a-feature [2] https://www.theverge.com/2023/11/16/23963607/google-deepmind-synthid-audio-watermarks [3] https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-speech-announces-public-preview-of-text-to-speech/ba-p/3981448 [4] https://www.barrons.com/articles/microsoft-ai-chip-maia-nvidia-f2b57a88 submitted by /u/Excellent-Target-847 [link] [comments]
    Seeing more and more prominent people now posting AI videos without realising they're fake. Truly the age of misinformation/disinformation
    submitted by /u/AnimalsChasingCars [link] [comments]
    Let's have some fun with Roast GPT.
    submitted by /u/aksh951357 [link] [comments]
    Suno AI auto-generated song about Stable Diffusion - SDXL
    submitted by /u/ptitrainvaloin [link] [comments]
  • Open

    Personal information in digital photos
    Is it possible to identify the people in the photo above? Maybe. Digital images potentially contain a large amount of metadata that could reveal the photographer’s identify and location. There may also be a surprising number of clues in the photo itself. EXIF metadata The standard format for image metadata is EXIF, Exchangeable Image File […] Personal information in digital photos first appeared on John D. Cook.  ( 6 min )
    What can you learn from a phone number?
    What can someone learn about you from your phone number? The answer depends on what other information someone has. Identifiers always depend on context. To a naked man in a tree [1] the phone number doesn’t carry any information. But to someone with a list of names and phone numbers, some sort of reverse phone […] What can you learn from a phone number? first appeared on John D. Cook.  ( 5 min )
  • Open

    [R] Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs
    Paper: https://arxiv.org/abs/2311.05657 Blog: https://allenai.github.io/lumos/ Code: https://github.com/allenai/lumos Models and data: https://huggingface.co/ai2lumos Abstract: We introduce Lumos, a novel framework for training language agents that employs a unified data format and a modular architecture based on open-source large language models (LLMs). Lumos consists of three distinct modules: planning, grounding, and execution. The planning module breaks down a task into a series of high-level, tool-agnostic subgoals, which are then made specific by the grounding module through a set of low-level actions. These actions are subsequently executed by the execution module, utilizing a range of off-the-shelf tools and APIs. In order to train these modules effectively, high-quality annotations of subgoals and actions were collected and are made available for fine-tuning open-source LLMs for various tasks such as complex question answering, web tasks, and math problems. Leveraging this unified data and modular design, Lumos not only achieves comparable or superior performance to current, state-of-the-art agents, but also exhibits several key advantages: (1) Lumos surpasses GPT-4/3.5-based agents in complex question answering and web tasks, while equalling the performance of significantly larger LLM agents on math tasks; (2) Lumos outperforms open-source agents created through conventional training methods and those using chain-of-thoughts training; and (3) Lumos is capable of effectively generalizing to unseen interactive tasks, outperforming larger LLM-based agents and even exceeding performance of specialized agents. https://preview.redd.it/igrcn7p8sz0c1.png?width=1097&format=png&auto=webp&s=96a9327131572e208eab07585c223c169fef7614 ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [P] Resources for learning CNNs offline?
    I'm getting on a long flight tomorrow, and I've been meaning to dig deep and get experience training CNNs myself or even implementing one. I have a lot of experience now with finetuning LLMs because nobody else could fill that role in my company this year, but I was just a SWE before, and I want to springboard off this into actual ML roles. So, does anyone have some solid notebooks or pdfs I can occupy myself with? And should I download a dataset from kaggle or something? Thanks submitted by /u/HPLaserJetM140we [link] [comments]  ( 9 min )
    [N] OpenAI Announces Leadership Transition, Fires Sam Altman
    Source: https://openai.com/blog/openai-announces-leadership-transition Today, it was announced that Sam Altman will no longer be CEO or affiliated with OpenAI due to a lack of “candidness” with the board. This is extremely unexpected as Sam Altman is arguably the most recognizable face of state of the art AI (of course, wouldn’t be possible without great team at OpenAI). Lots of speculation is in the air, but there clearly must have been some good reason to make such a drastic decision. This may or may not materially affect ML research, but it is plausible that the lack of “candidness” is related to copyright data, or usage of data sources that could land OpenAI in hot water with regulatory scrutiny. Recent lawsuits (https://www.reuters.com/legal/litigation/writers-suing-openai-fire-back-companys-copyright-defense-2023-09-28/) have raised questions about both the morality and legality of how OpenAI and other research groups train LLMs. Of course we may never know the true reasons behind this action, but what does this mean for the future of AI? submitted by /u/Sm0oth_kriminal [link] [comments]  ( 9 min )
    [P] Latent Consistency Models Open Source Art Generation API Server
    https://github.com/Netwrck/stable-diffusion-server Creating this to make it easy to create a server that makes AI Art. It uploads any created images to a google cloud storage bucket. Let me know what you think :) Its using SSD-1b latent consistency models and a pass with stable diffusion refiner which is getting crazy good results ​ https://preview.redd.it/f72ul7142z0c1.png?width=506&format=png&auto=webp&s=8f6434fb8e3d515637ef1ec94f7d30f49a29eb27 submitted by /u/ai_art_generater [link] [comments]  ( 9 min )
    [D] Tsetlin Machines, anyone heard of it?
    Hi, T have very little experience with Logic based Al algorithms, and my understanding is these machines use propositional logic algorithms (which wasn't something I have come across before, even when I was in uni). My experience is in statistical & neural based networks and can understand the pros and cons of those networks. I am just trying to understand what is the advantage and disadvantages of logic based algorithms specifically Tsetlin Machines? Is this something that I should learn more about? What are some good resources? Thanks submitted by /u/Loud-Consideration-2 [link] [comments]  ( 1 min )
    [Research] Any research in the area of similarity search for vector db's?
    I was wondering if anyone is aware of any current/ongoing research on similarity search for vector dbs? In the RAG models I have been working on, most of the accuracy of the result hinges on the similarity search returning the right document, and less on the language model. Thanks! submitted by /u/BakerInTheKitchen [link] [comments]  ( 9 min )
    [N] DeepMind now can predict the weather with less computing power than other approaches
    According to Nature ML methods are starting to outperform NWP without supercomputers https://www.nature.com/articles/d41586-023-03552-y and the coolest thing is that they already released the code https://github.com/google-deepmind/graphcast so I wonder how long would it take before we start seeing people pull data from home weather stations to run an hyperlocal fine-tuned model. submitted by /u/SCP_radiantpoison [link] [comments]  ( 9 min )
    [D] Supervized time-series segmentation from timestamp-based labels
    Hi, I have different audio time-series with timestamp-based annotations of multiple start + end segment boundaries that are denoting consecutive respiration cycles. Does anyone has an idea on how I could use supervized-learning approaches to automatically segment respiration cycles from audio time series ? Cheerz submitted by /u/deadalius [link] [comments]  ( 9 min )
    [P] Introducing Neural-Cherche: Enhance Document Retrieval with Advanced AI Models
    Hi everyone, I'm excited to share a tool I've developed called Neural-Cherche. Its main purpose is to transform a Sentence Transformer into a ColBERT model, which is currently at the forefront of information retrieval tools. What does Neural-Cherche do? In simple terms, it fine-tunes advanced AI models to make them more efficient at understanding and processing information. This results in a significant improvement in how effectively these models can find and retrieve relevant documents. This tool is particularly beneficial for those working with RAG (Retriever-Augmented Generation) pipelines. The ColBERT and Splade models, featured in SIGIR 2020 and 2021, have shown exceptional performance in document retrieval when they are fine-tuned, and Neural-Cherche makes this process easier and minimalist. Let me know what you think, and how it might be useful in your projects! submitted by /u/Ok-Cartoonist8114 [link] [comments]  ( 1 min )
    [P] Need help with project (Coreference Resolution using Deep Learning)
    Project: Coreference Resolution in Long Texts using Deep Learning I need help with understanding what I would need to study for this. It's not a research project so to say, but we are trying to do some kind of optimization for the accuracy in longer texts. We have to build a model and not just propose something new, i.e, if we are doing that. Right now I'm studying Linear Algebra by Gilbert Strang from MIT OCW. And plan to do Calculus (Single and Multivariate), Probability and Statistics after that. After this I am going to go for Machine Learning by Andrew Ng, NLP with Deep Learning by Christopher Manning and Natural Language Understanding by Christopher Potts. ​ This project has to be completed by April 2024 approximately. So I wanted to know if what I plan to do is enough/not enough/overkill or something else. Or if I should be taking a different path altogether. submitted by /u/burntToast1134 [link] [comments]  ( 9 min )
    [P] Versioning code & large models together in GitHub
    Hey r/MachineLearning! Last year, u/rajatarya showcased how we scaled Git to handle large datasets. One piece of feedback we kept getting is that people didn't want to move their source code over to XetHub. So we built a GitHub app & integration that lets you continue storing code in GitHub while XetHub handles the large datasets & models. https://about.xethub.com/blog/xetdata-scale-github-repos-100-tb We've enjoyed using it to host open source LLM's like Llama2 and Mistral with our finetuning code side-by-side. The whole thing is in beta so we're eager for any feedback you have to offer :) submitted by /u/semicausal [link] [comments]  ( 9 min )
    [R] Seeking input: transformer modification with 25-30% improvement in validation loss across 3 datasets
    Long time lurker here. Made an account just to post this. I've been experimenting with some modifications on the transformer architecture (addition of a new standalone component). Recently I got to something that seems to be an improvement of the validation loss by ~25-30% over vanilla decoder transformers; the task is next token prediction. My question is if this is significant enough to dedicate more serious effort in (eg. getting more compute credits to create a bigger model, running a beefier benchmark, sharing more with folks in academia to get feedback, write a paper, etc.) or if it's likely a fluke. In terms of methodology, I've compared vanilla-vs-modification on 3 datasets (in increasing difficulty): Penn Treebank, Lord of the Rings, the complete works of Shakespeare. The data…  ( 10 min )
    [D] How to choose weights for VotingRegressor?
    I have multiple models that are used in VotingRegressor i have used defaults weights like [1,1,2,1] is there any method, that gives optimal/best weights for each models? submitted by /u/_Killua_04 [link] [comments]  ( 9 min )
    [D] Hugging Face Models to AWS Sagemaker Endpoints: A MLOps Perspective
    I recently created this tutorial on how to go from a Hugging Face model to a AWS Sagemaker Endpoint in a production setting: https://www.zenml.io/blog/huggingface-to-sagemaker While I was making this project, it struck me how much of the tutorials around this sort of stuff really focuses on a one-time deployment, but IMO most of the problems really happen when you are doing this repeatably. That's why I've focused more on lineage, data tracking, and automated deployments in this post. What does everyone think: Is there a lack of best practices available to reliably do this sort of stuff? submitted by /u/htahir1 [link] [comments]  ( 9 min )
    Predicting optimal drug based on remission. [Research]
    I want to train a machine learning model that uses the following parameters to predict the optimal treatment choice. Say I have data on patient characteristics, imaging data, and which treatment was given to the patient (X, Y, Z), the outcome is complete remission. i want to note that for example if patient was given X and he went to remission then X is the drug of choice for that patient. If patient was give Y and it didnt work he can be given Z and if Z works then Z is the drug of choice for that patient. ​ The outcome here is not to predict the outcome, it is to predict the ideal treatment. The other way is I can predict outcome for each of the treatment choice. ​ How can one go about this? ​ ​ Thank you all submitted by /u/I_only_wanna_learn [link] [comments]  ( 9 min )
    [Discussion] Machine learning for predicting treatment regimine
    Hello community, ​ I am working on a machine learning project where the goal is to predict the optimal treatment choice for individual patients based on their characteristics, imaging data, and previous treatment outcomes. The outcome I'm interested in is achieving complete remission. ​ Here's a brief overview of the problem: ​ Input Features: ​ Patient Characteristics (e.g., age, gender, medical history) Imaging Data Previous Treatment Given (X, Y, Z) Output: ​ Optimal Treatment Choice for Complete Remission I want the model to learn from cases where a particular treatment led to remission and, if not, explore alternative treatments. ​ I'm considering two approaches: ​ Predicting Optimal Treatment Directly: ​ Training the model to directly predict the optimal treatment choice for a given patient based on their features. Predicting Treatment Outcome for Each Choice: ​ Training separate models for each treatment choice to predict the likelihood of complete remission. Questions: ​ Which approach do you think is more suitable for this problem? What kind of machine learning algorithms and techniques would you recommend for this task? How can I handle cases where the initial treatment doesn't lead to remission, and an alternative treatment is given? I appreciate any guidance, insights, or examples you can provide to help me structure and approach this problem effectively. Thank you! submitted by /u/ibrahimredit [link] [comments]
    [D] what are the latest trends /breakthroughs in low level architecture/operators composition?
    is there any novelty on architecture design/operators level during the last year? I was looking into recent SOTA models both for NLP and computer vision and seems it is mainly the same concept - attention with slightly different training approach / different number of same layers combined together. Is there any innovation on lower level happening? Ie any new promising architectures, design approaches or operators which were introduced lately? Thank you. submitted by /u/Grrr11 [link] [comments]  ( 9 min )
    [D] The Evolution of Serverless GPUs: In-Depth Performance & Cost Analysis for Llama 2 7Bn & Stable Diffusion 2-1 model across providers
    In the evolving landscape of AI Infrastructure, Serverless GPUs have been a game changer. Six months on from our last guide, which sparked multiple discussions & created more awareness about the space, we've returned with fresh insights on the state of "True Serverless" offerings and I am here sharing performance benchmark & cost effectiveness analysis for Llama 2-7Bn & Stable Diffusion 2-1 model. 📊 Performance Testing Methodology: We put the spotlight on popular serverless GPU contenders: Runpod, Replicate, Inferless, and Hugging Face Inference Endpoints, specifically testing for: 1. Cold Starts: Varied across platforms. Latency minus inference time, represents the delay due to initializing a dormant Serverless function. Average Cold-starts across platforms 2. Variability: We don't …  ( 10 min )
    [P] Obscure speech recording while keeping data useful for machine learning
    Is there a way to obscure speech recording so there is no way to play it and get something intelligible, but still keep it useful for machine learning? For my project I have to collect data in uncontrolled environment, and I would like to do it without accidentally storing sensitive information. It seems to be an uncommon problem, and I haven't found much. I am currently using spectrograms to extract features. For what I have found, making a spectrogram from a soundwave uses STFT and doesn't store phase information, so there is not enough information to perform the inverse transformation. Do I understand this correctly? What are other ways to do it? submitted by /u/vladisser [link] [comments]  ( 9 min )
    [R] Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models
    submitted by /u/hardmaru [link] [comments]  ( 9 min )
    [R] Neural MMO 2.0: A Massively Multi-task Addition to Massively Multi-agent Learning
    Paper: https://arxiv.org/abs/2311.03736 Project page: https://neuralmmo.github.io GitHub: https://github.com/neuralmmo Abstract: Neural MMO 2.0 is a massively multi-agent environment for reinforcement learning research. The key feature of this new version is a flexible task system that allows users to define a broad range of objectives and reward signals. We challenge researchers to train agents capable of generalizing to tasks, maps, and opponents never seen during training. Neural MMO features procedurally generated maps with 128 agents in the standard setting and support for up to. Version 2.0 is a complete rewrite of its predecessor with three-fold improved performance and compatibility with CleanRL. We release the platform as free and open-source software with comprehensive documentation available at this http URL and an active community Discord. To spark initial research on this new platform, we are concurrently running a competition at NeurIPS 2023. https://preview.redd.it/8pvecr133t0c1.png?width=901&format=png&auto=webp&s=f9bf7f063c1b9950d40ad2a3b04bdc7f0780413f ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
  • Open

    Can AI or GPT-4 Answer my $1m Question?
    Imagine if GPT-4 could tell traders how to play on the stock market. For instance, using insider information gathered on the Internet, it could produce winning buy and sell signals, with target prices, for various stocks. Then, anyone could use simple English, or even an API to get the right predictions and trade automatically. Would… Read More »Can AI or GPT-4 Answer my $1m Question? The post Can AI or GPT-4 Answer my $1m Question? appeared first on Data Science Central.  ( 23 min )
  • Open

    Emerging practices for Society-Centered AI
    Posted by Anoop Sinha, Research Director, Technology & Society, and Yossi Matias, Vice President, Google Research The first of Google’s AI Principles is to “Be socially beneficial.” As AI practitioners, we’re inspired by the transformative potential of AI technologies to benefit society and our shared environment at a scale and swiftness that wasn’t possible before. From helping address the climate crisis to helping transform healthcare, to making the digital world more accessible, our goal is to apply AI responsibly to be helpful to more people around the globe. Achieving global scale requires researchers and communities to think ahead — and act — collectively across the AI ecosystem. We call this approach Society-Centered AI. It is both an extension and an expansion of Human-Cente…  ( 93 min )
  • Open

    Skeleton-of-Thought: Parallel decoding speeds up and improves LLM output
    Large language models (LLMs) such as LLaMA and OpenAI’s GPT-4 are revolutionizing technology. However, one of the common complaints about LLMs is their speed, or lack thereof. In many cases, it takes a long time to get an answer from them. This limits LLMs’ applications and their usefulness in latency-critical functions, such as chatbots, copilots, […] The post Skeleton-of-Thought: Parallel decoding speeds up and improves LLM output appeared first on Microsoft Research.  ( 12 min )
  • Open

    From Guangzhou to Los Angeles, Automakers Dazzle With AI-Powered Vehicles
    Good news for car lovers: Two acclaimed auto shows, taking place now through next week, are delighting attendees with displays of next-generation automotive designs powered by AI. Hundreds of thousands of auto enthusiasts worldwide are expected to visit Guangzhou, China — known as the city of flowers — to attend its auto show, running through Read article >  ( 6 min )
    AI to See ‘Major Second Wave,’ NVIDIA CEO Says in Fireside Chat With iliad Group Exec
    European startups will get a massive boost from a new generation of AI infrastructure, NVIDIA founder and CEO Jensen Huang said Friday in a fireside chat with iliad Group Deputy CEO Aude Durand — and it’s coming just in time. “We’re now seeing a major second wave,” Huang said of the state of AI during Read article >  ( 7 min )
    NVIDIA and Scaleway Speed Development for European Startups and Enterprises
    Europe’s startup ecosystem is getting a boost of accelerated computing for generative AI. NVIDIA and cloud service provider (CSP) Scaleway are working together to deliver access to GPUs, NVIDIA AI Enterprise software, and services for turbocharging large language models (LLMs) and generative AI development for European startups. Scaleway, a subsidiary of French telecommunications provider iliad Read article >  ( 6 min )
  • Open

    Moderate your Amazon IVS live stream using Amazon Rekognition
    Amazon Interactive Video Service (Amazon IVS) is a managed live streaming solution that is designed to provide a quick and straightforward setup to let you build interactive video experiences and handles interactive video content from ingestion to delivery. With the increased usage of live streaming, the need for effective content moderation becomes even more crucial. […]  ( 9 min )
    Retrieval-Augmented Generation with LangChain, Amazon SageMaker JumpStart, and MongoDB Atlas semantic search
    Generative AI models have the potential to revolutionize enterprise operations, but businesses must carefully consider how to harness their power while overcoming challenges such as safeguarding data and ensuring the quality of AI-generated content. The Retrieval-Augmented Generation (RAG) framework augments prompts with external data from multiple sources, such as document repositories, databases, or APIs, to […]  ( 6 min )
    Build a foundation model (FM) powered customer service bot with agents for Amazon Bedrock
    From enhancing the conversational experience to agent assistance, there are plenty of ways that generative artificial intelligence (AI) and foundation models (FMs) can help deliver faster, better support. With the increasing availability and diversity of FMs, it’s difficult to experiment and keep up-to-date with the latest model versions. Amazon Bedrock is a fully managed service […]  ( 7 min )
  • Open

    OpenAI announces leadership transition
    Chief technology officer Mira Murati appointed interim CEO to lead OpenAI; Sam Altman departs the company. Search process underway to identify permanent successor.  ( 2 min )
  • Open

    Can ordinary backpropagation networks be used for time series data, if temporally successive data values are entered into successive network input neurons?
    Question: If one is going to use a neural network for future stock market price prediction, where you have a time series of stock price values, can an ordinary backpropagation network be used for this purpose? I am think of taking the stock price every 15 minutes, over a period of say 3 days (so that there will be 3 x 24 x 4 = 288 price values), and then entering these values (in normalised form) into a backpropagation network with 288 input neurons. This network setup would then be used to predict what will happen to the stock price a few hours into the future. Would this setup work as effectively as a long short-term memory (LSTM) network in terms of future price prediction? Or do LSTM networks offer something over and above ordinary backpropagation networks? submitted by /u/GA64 [link] [comments]  ( 9 min )
    What Meta learned from Galactica, the doomed model launched two weeks before ChatGPT
    submitted by /u/nickb [link] [comments]  ( 1 min )
  • Open

    How would one go about creating a reward function from raw-pixel input?
    I played around a bit with OpenAI Gym. It returns the state as RGB frames and the reward function is pre-defined by the environment somehow (I have no idea how - for the OpenAI Gym Mario environment I used for example it was the distance along the x-axis). How was this distance obtained? In any case; I will ask my actual question now. Say I am trying to make a game-playing agent for arbitrary game X which may be 2D or 3D. How would one go about making a game playing agent for a complex game like Call of Duty: World War II which have large maps & missions that are not straight-foward etc. the "reward" might be received over longer periods in a not-so-obvious manner. Any literature review around this? submitted by /u/universecoder [link] [comments]
    "Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero", Schut et al 2023 {DM} (identifying concepts in superhuman chess engines that give rise to a plan)
    submitted by /u/gwern [link] [comments]
    Hosting RL Challenges on Kaggle [D]
    submitted by /u/Brown_bagheera [link] [comments]
  • Open

    MiniSUPERB: Lightweight Benchmark for Self-supervised Speech Models. (arXiv:2305.19011v3 [eess.AS] UPDATED)
    SUPERB was proposed to evaluate the generalizability of self-supervised learning (SSL) speech models across various tasks. However, it incurs high computational costs due to the large datasets and diverse tasks. In this paper, we introduce MiniSUPERB, a lightweight benchmark that efficiently evaluates SSL speech models with comparable results to SUPERB but lower computational costs significantly. We carefully select representative tasks, sample datasets, and extract model representations offline. Our approach achieves a Spearman's rank correlation of 0.954 and 0.982 with SUPERB Paper and SUPERB Challenge, respectively. Additionally, we reduce the computational cost by 97% in terms of Multiply-ACcumulate operations (MACs). Furthermore, we evaluate SSL speech models in few-shot scenarios and observe significant variations in their performance. To our knowledge, this is the first study to examine both the computational cost of the model itself and the cost of evaluating it on a benchmark.  ( 2 min )
    GenHPF: General Healthcare Predictive Framework with Multi-task Multi-source Learning. (arXiv:2207.09858v3 [cs.LG] UPDATED)
    Despite the remarkable progress in the development of predictive models for healthcare, applying these algorithms on a large scale has been challenging. Algorithms trained on a particular task, based on specific data formats available in a set of medical records, tend to not generalize well to other tasks or databases in which the data fields may differ. To address this challenge, we propose General Healthcare Predictive Framework (GenHPF), which is applicable to any EHR with minimal preprocessing for multiple prediction tasks. GenHPF resolves heterogeneity in medical codes and schemas by converting EHRs into a hierarchical textual representation while incorporating as many features as possible. To evaluate the efficacy of GenHPF, we conduct multi-task learning experiments with single-source and multi-source settings, on three publicly available EHR datasets with different schemas for 12 clinically meaningful prediction tasks. Our framework significantly outperforms baseline models that utilize domain knowledge in multi-source learning, improving average AUROC by 1.2%P in pooled learning and 2.6%P in transfer learning while also showing comparable results when trained on a single EHR dataset. Furthermore, we demonstrate that self-supervised pretraining using multi-source datasets is effective when combined with GenHPF, resulting in a 0.6%P AUROC improvement compared to models without pretraining. By eliminating the need for preprocessing and feature engineering, we believe that this work offers a solid framework for multi-task and multi-source learning that can be leveraged to speed up the scaling and usage of predictive algorithms in healthcare.  ( 3 min )
    Is Channel Independent strategy optimal for Time Series Forecasting?. (arXiv:2310.17658v2 [cs.LG] UPDATED)
    There has been an emergence of various models for long-term time series forecasting. Recent studies have demonstrated that a single linear layer, using Channel Dependent (CD) or Channel Independent (CI) modeling, can even outperform a large number of sophisticated models. However, current research primarily considers CD and CI as two complementary yet mutually exclusive approaches, unable to harness these two extremes simultaneously. And it is also a challenging issue that both CD and CI are static strategies that cannot be determined to be optimal for a specific dataset without extensive experiments. In this paper, we reconsider whether the current CI strategy is the best solution for time series forecasting. First, we propose a simple yet effective strategy called CSC, which stands for $\mathbf{C}$hannel $\mathbf{S}$elf-$\mathbf{C}$lustering strategy, for linear models. Our Channel Self-Clustering (CSC) enhances CI strategy's performance improvements while reducing parameter size, for exmpale by over 10 times on electricity dataset, and significantly cutting training time. Second, we further propose Channel Rearrangement (CR), a method for deep models inspired by the self-clustering. CR attains competitive performance against baselines. Finally, we also discuss whether it is best to forecast the future values using the historical values of the same channel as inputs. We hope our findings and methods could inspire new solutions beyond CD/CI.  ( 2 min )
    The Transient Nature of Emergent In-Context Learning in Transformers. (arXiv:2311.08360v2 [cs.LG] UPDATED)
    Transformer neural networks can exhibit a surprising capacity for in-context learning (ICL) despite not being explicitly trained for it. Prior work has provided a deeper understanding of how ICL emerges in transformers, e.g. through the lens of mechanistic interpretability, Bayesian inference, or by examining the distributional properties of training data. However, in each of these cases, ICL is treated largely as a persistent phenomenon; namely, once ICL emerges, it is assumed to persist asymptotically. Here, we show that the emergence of ICL during transformer training is, in fact, often transient. We train transformers on synthetic data designed so that both ICL and in-weights learning (IWL) strategies can lead to correct predictions. We find that ICL first emerges, then disappears and gives way to IWL, all while the training loss decreases, indicating an asymptotic preference for IWL. The transient nature of ICL is observed in transformers across a range of model sizes and datasets, raising the question of how much to "overtrain" transformers when seeking compact, cheaper-to-run models. We find that L2 regularization may offer a path to more persistent ICL that removes the need for early stopping based on ICL-style validation tasks. Finally, we present initial evidence that ICL transience may be caused by competition between ICL and IWL circuits.  ( 3 min )
    RLtools: A Fast, Portable Deep Reinforcement Learning Library for Continuous Control. (arXiv:2306.03530v2 [cs.LG] UPDATED)
    Deep Reinforcement Learning (RL) has been demonstrated to yield capable agents and control policies in several domains but is commonly plagued by prohibitively long training times. Additionally, in the case of continuous control problems, the applicability of learned policies on real-world embedded devices is limited due to the lack of real-time guarantees and portability of existing deep learning libraries. To address these challenges, we present RLtools, a dependency-free, header-only, pure C++ library for deep supervised and reinforcement learning. Leveraging the template meta-programming capabilities of recent C++ standards, we provide composable components that can be tightly integrated by the compiler. Its novel architecture allows RLtools to be used seamlessly on a heterogeneous set of platforms, from HPC clusters over workstations and laptops to smartphones, smartwatches, and microcontrollers. Specifically, due to the tight integration of the RL algorithms with simulation environments, RLtools can solve popular RL problems like the Pendulum-v1 swing-up about 7 to 15 times faster in terms of wall-clock training time compared to other popular RL frameworks when using TD3. We also provide a low-overhead and parallelized interface to the MuJoCo simulator, showing that our PPO implementation achieves state of the art returns in the Ant-v4 environment while being 25%-30% faster in terms of wall-clock training time. Finally, we also benchmark the policy inference on a diverse set of microcontrollers and show that in most cases our optimized inference implementation is much faster than even the manufacturer's DSP libraries. To the best of our knowledge, RLtools enables the first-ever demonstration of training a deep RL algorithm directly on a microcontroller, giving rise to the field of TinyRL. The source code is available through our project page at https://rl.tools.  ( 3 min )
    Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models. (arXiv:2311.08692v1 [cs.CL])
    The complementary potential of Large Language Models (LLM) assumes off-the-shelf LLMs have heterogeneous expertise in a wide range of domains and tasks so that an ensemble of LLMs can achieve consistently better performance. Existing ensemble methods for LLMs mainly focus on reward model ranking of outputs, leading to significant computation overhead. To combat this issue, we revisit the complementary potential of LLMs and further elaborate it by mining latent expertise with off-the-shelf reward models. We propose Zooter, a reward-guided routing method distilling rewards on training queries to train a routing function, which can precisely distribute each query to the LLM with expertise about it. We also integrate a tag-based label enhancement to mitigate noise from uncertainty when using rewards as silver supervision. Zooter shows computation efficiency in inference as it introduces only a minor computation overhead of a routing function compared with reward model ranking methods. We evaluate Zooter on a comprehensive benchmark collection with 26 subsets on different domains and tasks. Zooter outperforms the best single model on average and ranks first on 44% of tasks, even surpassing multiple reward model ranking methods.  ( 2 min )
    Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and Exp-Concave Games with Gradient Feedback. (arXiv:2310.14085v3 [cs.GT] UPDATED)
    Online gradient descent (OGD) is well known to be doubly optimal under strong convexity or monotonicity assumptions: (1) in the single-agent setting, it achieves an optimal regret of $\Theta(\log T)$ for strongly convex cost functions; and (2) in the multi-agent setting of strongly monotone games, with each agent employing OGD, we obtain last-iterate convergence of the joint action to a unique Nash equilibrium at an optimal rate of $\Theta(\frac{1}{T})$. While these finite-time guarantees highlight its merits, OGD has the drawback that it requires knowing the strong convexity/monotonicity parameters. In this paper, we design a fully adaptive OGD algorithm, \textsf{AdaOGD}, that does not require a priori knowledge of these parameters. In the single-agent setting, our algorithm achieves $O(\log^2(T))$ regret under strong convexity, which is optimal up to a log factor. Further, if each agent employs \textsf{AdaOGD} in strongly monotone games, the joint action converges in a last-iterate sense to a unique Nash equilibrium at a rate of $O(\frac{\log^3 T}{T})$, again optimal up to log factors. We illustrate our algorithms in a learning version of the classical newsvendor problem, where due to lost sales, only (noisy) gradient feedback can be observed. Our results immediately yield the first feasible and near-optimal algorithm for both the single-retailer and multi-retailer settings. We also extend our results to the more general setting of exp-concave cost functions and games, using the online Newton step (ONS) algorithm.  ( 3 min )
    Analyzing Inexact Hypergradients for Bilevel Learning. (arXiv:2301.04764v2 [math.OC] UPDATED)
    Estimating hyperparameters has been a long-standing problem in machine learning. We consider the case where the task at hand is modeled as the solution to an optimization problem. Here the exact gradient with respect to the hyperparameters cannot be feasibly computed and approximate strategies are required. We introduce a unified framework for computing hypergradients that generalizes existing methods based on the implicit function theorem and automatic differentiation/backpropagation, showing that these two seemingly disparate approaches are actually tightly connected. Our framework is extremely flexible, allowing its subproblems to be solved with any suitable method, to any degree of accuracy. We derive a priori and computable a posteriori error bounds for all our methods, and numerically show that our a posteriori bounds are usually more accurate. Our numerical results also show that, surprisingly, for efficient bilevel optimization, the choice of hypergradient algorithm is at least as important as the choice of lower-level solver.  ( 2 min )
    Attention for Causal Relationship Discovery from Biological Neural Dynamics. (arXiv:2311.06928v2 [cs.LG] UPDATED)
    This paper explores the potential of the transformer models for learning Granger causality in networks with complex nonlinear dynamics at every node, as in neurobiological and biophysical networks. Our study primarily focuses on a proof-of-concept investigation based on simulated neural dynamics, for which the ground-truth causality is known through the underlying connectivity matrix. For transformer models trained to forecast neuronal population dynamics, we show that the cross attention module effectively captures the causal relationship among neurons, with an accuracy equal or superior to that for the most popular Granger causality analysis method. While we acknowledge that real-world neurobiology data will bring further challenges, including dynamic connectivity and unobserved variability, this research offers an encouraging preliminary glimpse into the utility of the transformer model for causal representation learning in neuroscience.  ( 2 min )
    MiniZero: Comparative Analysis of AlphaZero and MuZero on Go, Othello, and Atari Games. (arXiv:2310.11305v2 [cs.AI] UPDATED)
    This paper presents MiniZero, a zero-knowledge learning framework that supports four state-of-the-art algorithms, including AlphaZero, MuZero, Gumbel AlphaZero, and Gumbel MuZero. While these algorithms have demonstrated super-human performance in many games, it remains unclear which among them is most suitable or efficient for specific tasks. Through MiniZero, we systematically evaluate the performance of each algorithm in two board games, 9x9 Go and 8x8 Othello, as well as 57 Atari games. For two board games, using more simulations generally results in higher performance. However, the choice of AlphaZero and MuZero may differ based on game properties. For Atari games, both MuZero and Gumbel MuZero are worth considering. Since each game has unique characteristics, different algorithms and simulations yield varying results. In addition, we introduce an approach, called progressive simulation, which progressively increases the simulation budget during training to allocate computation more efficiently. Our empirical results demonstrate that progressive simulation achieves significantly superior performance in two board games. By making our framework and trained models publicly available, this paper contributes a benchmark for future research on zero-knowledge learning algorithms, assisting researchers in algorithm selection and comparison against these zero-knowledge learning baselines. Our code and data are available at https://rlg.iis.sinica.edu.tw/papers/minizero.  ( 3 min )
    Latent SDEs on Homogeneous Spaces. (arXiv:2306.16248v2 [cs.LG] UPDATED)
    We consider the problem of variational Bayesian inference in a latent variable model where a (possibly complex) observed stochastic process is governed by the solution of a latent stochastic differential equation (SDE). Motivated by the challenges that arise when trying to learn an (almost arbitrary) latent neural SDE from large-scale data, such as efficient gradient computation, we take a step back and study a specific subclass instead. In our case, the SDE evolves on a homogeneous latent space and is induced by stochastic dynamics of the corresponding (matrix) Lie group. In learning problems, SDEs on the unit $n$-sphere are arguably the most relevant incarnation of this setup. Notably, for variational inference, the sphere not only facilitates using a truly uninformative prior SDE, but we also obtain a particularly simple and intuitive expression for the Kullback-Leibler divergence between the approximate posterior and prior process in the evidence lower bound. Experiments demonstrate that a latent SDE of the proposed type can be learned efficiently by means of an existing one-step geometric Euler-Maruyama scheme. Despite restricting ourselves to a less diverse class of SDEs, we achieve competitive or even state-of-the-art performance on various time series interpolation and classification benchmarks.  ( 2 min )
    Sample Dominance Aware Framework via Non-Parametric Estimation for Spontaneous Brain-Computer Interface. (arXiv:2311.07079v2 [cs.LG] UPDATED)
    Deep learning has shown promise in decoding brain signals, such as electroencephalogram (EEG), in the field of brain-computer interfaces (BCIs). However, the non-stationary characteristics of EEG signals pose challenges for training neural networks to acquire appropriate knowledge. Inconsistent EEG signals resulting from these non-stationary characteristics can lead to poor performance. Therefore, it is crucial to investigate and address sample inconsistency to ensure robust performance in spontaneous BCIs. In this study, we introduce the concept of sample dominance as a measure of EEG signal inconsistency and propose a method to modulate its effect on network training. We present a two-stage dominance score estimation technique that compensates for performance degradation caused by sample inconsistencies. Our proposed method utilizes non-parametric estimation to infer sample inconsistency and assigns each sample a dominance score. This score is then aggregated with the loss function during training to modulate the impact of sample inconsistency. Furthermore, we design a curriculum learning approach that gradually increases the influence of inconsistent signals during training to improve overall performance. We evaluate our proposed method using public spontaneous BCI dataset. The experimental results confirm that our findings highlight the importance of addressing sample dominance for achieving robust performance in spontaneous BCIs.  ( 2 min )
    Are Deep Learning Classification Results Obtained on CT Scans Fair and Interpretable?. (arXiv:2309.12632v2 [cs.LG] UPDATED)
    Following the great success of various deep learning methods in image and object classification, the biomedical image processing society is also overwhelmed with their applications to various automatic diagnosis cases. Unfortunately, most of the deep learning-based classification attempts in the literature solely focus on the aim of extreme accuracy scores, without considering interpretability, or patient-wise separation of training and test data. For example, most lung nodule classification papers using deep learning randomly shuffle data and split it into training, validation, and test sets, causing certain images from the CT scan of a person to be in the training set, while other images of the exact same person to be in the validation or testing image sets. This can result in reporting misleading accuracy rates and the learning of irrelevant features, ultimately reducing the real-life usability of these models. When the deep neural networks trained on the traditional, unfair data shuffling method are challenged with new patient images, it is observed that the trained models perform poorly. In contrast, deep neural networks trained with strict patient-level separation maintain their accuracy rates even when new patient images are tested. Heat-map visualizations of the activations of the deep neural networks trained with strict patient-level separation indicate a higher degree of focus on the relevant nodules. We argue that the research question posed in the title has a positive answer only if the deep neural networks are trained with images of patients that are strictly isolated from the validation and testing patient sets.  ( 3 min )
    Arbitrariness and Prediction: The Confounding Role of Variance in Fair Classification. (arXiv:2301.11562v6 [cs.LG] UPDATED)
    Variance in predictions across different trained models is a significant, under-explored source of error in fair binary classification. In practice, the variance on some data examples is so large that decisions can be effectively arbitrary. To investigate this problem, we take an experimental approach and make four overarching contributions: We: 1) Define a metric called self-consistency, derived from variance, which we use as a proxy for measuring and reducing arbitrariness; 2) Develop an ensembling algorithm that abstains from classification when a prediction would be arbitrary; 3) Conduct the largest to-date empirical study of the role of variance (vis-a-vis self-consistency and arbitrariness) in fair binary classification; and, 4) Release a toolkit that makes the US Home Mortgage Disclosure Act (HMDA) datasets easily usable for future research. Altogether, our experiments reveal shocking insights about the reliability of conclusions on benchmark datasets. Most fair binary classification benchmarks are close-to-fair when taking into account the amount of arbitrariness present in predictions -- before we even try to apply any fairness interventions. This finding calls into question the practical utility of common algorithmic fairness methods, and in turn suggests that we should reconsider how we choose to measure fairness in binary classification.  ( 3 min )
    ManimML: Communicating Machine Learning Architectures with Animation. (arXiv:2306.17108v3 [cs.LG] UPDATED)
    There has been an explosion in interest in machine learning (ML) in recent years due to its applications to science and engineering. However, as ML techniques have advanced, tools for explaining and visualizing novel ML algorithms have lagged behind. Animation has been shown to be a powerful tool for making engaging visualizations of systems that dynamically change over time, which makes it well suited to the task of communicating ML algorithms. However, the current approach to animating ML algorithms is to handcraft applications that highlight specific algorithms or use complex generalized animation software. We developed ManimML, an open-source Python library for easily generating animations of ML algorithms directly from code. We sought to leverage ML practitioners' preexisting knowledge of programming rather than requiring them to learn complex animation software. ManimML has a familiar syntax for specifying neural networks that mimics popular deep learning frameworks like Pytorch. A user can take a preexisting neural network architecture and easily write a specification for an animation in ManimML, which will then automatically compose animations for different components of the system into a final animation of the entire neural network. ManimML is open source and available at https://github.com/helblazer811/ManimML.  ( 2 min )
    PriorBand: Practical Hyperparameter Optimization in the Age of Deep Learning. (arXiv:2306.12370v2 [cs.LG] UPDATED)
    Hyperparameters of Deep Learning (DL) pipelines are crucial for their downstream performance. While a large number of methods for Hyperparameter Optimization (HPO) have been developed, their incurred costs are often untenable for modern DL. Consequently, manual experimentation is still the most prevalent approach to optimize hyperparameters, relying on the researcher's intuition, domain knowledge, and cheap preliminary explorations. To resolve this misalignment between HPO algorithms and DL researchers, we propose PriorBand, an HPO algorithm tailored to DL, able to utilize both expert beliefs and cheap proxy tasks. Empirically, we demonstrate PriorBand's efficiency across a range of DL benchmarks and show its gains under informative expert input and robustness against poor expert beliefs  ( 2 min )
    Adversarial Attacks to Reward Machine-based Reinforcement Learning. (arXiv:2311.09014v1 [cs.LG])
    In recent years, Reward Machines (RMs) have stood out as a simple yet effective automata-based formalism for exposing and exploiting task structure in reinforcement learning settings. Despite their relevance, little to no attention has been directed to the study of their security implications and robustness to adversarial scenarios, likely due to their recent appearance in the literature. With my thesis, I aim to provide the first analysis of the security of RM-based reinforcement learning techniques, with the hope of motivating further research in the field, and I propose and evaluate a novel class of attacks on RM-based techniques: blinding attacks.  ( 2 min )
    Towards a Transportable Causal Network Model Based on Observational Healthcare Data. (arXiv:2311.08427v1 [cs.LG])
    Over the last decades, many prognostic models based on artificial intelligence techniques have been used to provide detailed predictions in healthcare. Unfortunately, the real-world observational data used to train and validate these models are almost always affected by biases that can strongly impact the outcomes validity: two examples are values missing not-at-random and selection bias. Addressing them is a key element in achieving transportability and in studying the causal relationships that are critical in clinical decision making, going beyond simpler statistical approaches based on probabilistic association. In this context, we propose a novel approach that combines selection diagrams, missingness graphs, causal discovery and prior knowledge into a single graphical model to estimate the cardiovascular risk of adolescent and young females who survived breast cancer. We learn this model from data comprising two different cohorts of patients. The resulting causal network model is validated by expert clinicians in terms of risk assessment, accuracy and explainability, and provides a prognostic model that outperforms competing machine learning methods.  ( 2 min )
    Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation. (arXiv:2311.00684v2 [cs.CL] UPDATED)
    An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning. Such long-context utilization capability relies heavily on a flexible positional embedding design. Upon investigating the flexibility of existing large pre-trained Transformer language models, we find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns. However, T5 suffers from the dispersed attention issue: the longer the input sequence, the flatter the attention distribution. To alleviate the issue, we propose two attention alignment strategies via temperature scaling. Our findings show improvement on the long-context utilization capability of T5 on language modeling, retrieval, multi-document question answering, and code completion tasks without any fine-tuning. This suggests that a flexible positional embedding design and attention alignment can go a long way toward Transformer length extrapolation.  ( 2 min )
    Scheming AIs: Will AIs fake alignment during training in order to get power?. (arXiv:2311.08379v2 [cs.CY] UPDATED)
    This report examines whether advanced AIs that perform well in training will be doing so in order to gain power later -- a behavior I call "scheming" (also sometimes called "deceptive alignment"). I conclude that scheming is a disturbingly plausible outcome of using baseline machine learning methods to train goal-directed AIs sophisticated enough to scheme (my subjective probability on such an outcome, given these conditions, is roughly 25%). In particular: if performing well in training is a good strategy for gaining power (as I think it might well be), then a very wide variety of goals would motivate scheming -- and hence, good training performance. This makes it plausible that training might either land on such a goal naturally and then reinforce it, or actively push a model's motivations towards such a goal as an easy way of improving performance. What's more, because schemers pretend to be aligned on tests designed to reveal their motivations, it may be quite difficult to tell whether this has occurred. However, I also think there are reasons for comfort. In particular: scheming may not actually be such a good strategy for gaining power; various selection pressures in training might work against schemer-like goals (for example, relative to non-schemers, schemers need to engage in extra instrumental reasoning, which might harm their training performance); and we may be able to increase such pressures intentionally. The report discusses these and a wide variety of other considerations in detail, and it suggests an array of empirical research directions for probing the topic further.  ( 3 min )
    New Horizons in Parameter Regularization: A Constraint Approach. (arXiv:2311.09058v1 [cs.LG])
    This work presents constrained parameter regularization (CPR), an alternative to traditional weight decay. Instead of applying a constant penalty uniformly to all parameters, we enforce an upper bound on a statistical measure (e.g., the L$_2$-norm) of individual parameter groups. This reformulates learning as a constrained optimization problem. To solve this, we utilize an adaptation of the augmented Lagrangian method. Our approach allows for varying regularization strengths across different parameter groups, removing the need for explicit penalty coefficients in the regularization terms. CPR only requires two hyperparameters and introduces no measurable runtime overhead. We offer empirical evidence of CPR's effectiveness through experiments in the "grokking" phenomenon, image classification, and language modeling. Our findings show that CPR can counteract the effects of grokking, and it consistently matches or surpasses the performance of traditional weight decay.
    Converting Transformers to Polynomial Form for Secure Inference Over Homomorphic Encryption. (arXiv:2311.08610v1 [cs.LG])
    Designing privacy-preserving deep learning models is a major challenge within the deep learning community. Homomorphic Encryption (HE) has emerged as one of the most promising approaches in this realm, enabling the decoupling of knowledge between the model owner and the data owner. Despite extensive research and application of this technology, primarily in convolutional neural networks, incorporating HE into transformer models has been challenging because of the difficulties in converting these models into a polynomial form. We break new ground by introducing the first polynomial transformer, providing the first demonstration of secure inference over HE with transformers. This includes a transformer architecture tailored for HE, alongside a novel method for converting operators to their polynomial equivalent. This innovation enables us to perform secure inference on LMs with WikiText-103. It also allows us to perform image classification with CIFAR-100 and Tiny-ImageNet. Our models yield results comparable to traditional methods, bridging the performance gap with transformers of similar scale and underscoring the viability of HE for state-of-the-art applications. Finally, we assess the stability of our models and conduct a series of ablations to quantify the contribution of each model component.
    Graph-based Neural Weather Prediction for Limited Area Modeling. (arXiv:2309.17370v2 [cs.LG] UPDATED)
    The rise of accurate machine learning methods for weather forecasting is creating radical new possibilities for modeling the atmosphere. In the time of climate change, having access to high-resolution forecasts from models like these is also becoming increasingly vital. While most existing Neural Weather Prediction (NeurWP) methods focus on global forecasting, an important question is how these techniques can be applied to limited area modeling. In this work we adapt the graph-based NeurWP approach to the limited area setting and propose a multi-scale hierarchical model extension. Our approach is validated by experiments with a local model for the Nordic region.
    CFARnet: deep learning for target detection with constant false alarm rate. (arXiv:2208.02474v3 [cs.LG] UPDATED)
    We consider the problem of target detection with a constant false alarm rate (CFAR). This constraint is crucial in many practical applications and is a standard requirement in classical composite hypothesis testing. In settings where classical approaches are computationally expensive or where only data samples are given, machine learning methodologies are advantageous. CFAR is less understood in these settings. To close this gap, we introduce a framework of CFAR constrained detectors. Theoretically, we prove that a CFAR constrained Bayes optimal detector is asymptotically equivalent to the classical generalized likelihood ratio test (GLRT). Practically, we develop a deep learning framework for fitting neural networks that approximate it. Experiments of target detection in different setting demonstrate that the proposed CFARnet allows a flexible tradeoff between CFAR and accuracy.
    On Wasted Contributions: Understanding the Dynamics of Contributor-Abandoned Pull Requests. (arXiv:2110.15447v2 [cs.SE] CROSS LISTED)
    Pull-based development has enabled numerous volunteers to contribute to open-source projects with fewer barriers. Nevertheless, a considerable amount of pull requests (PRs) with valid contributions are abandoned by their contributors, wasting the effort and time put in by both the contributors and maintainers. To better understand the underlying dynamics of contributor-abandoned PRs, we conduct a mixed-methods study using both quantitative and qualitative methods. We curate a dataset consisting of 265,325 PRs including 4,450 abandoned ones from ten popular and mature GitHub projects and measure 16 features characterizing PRs, contributors, review processes, and projects. Using statistical and machine learning techniques, we find that complex PRs, novice contributors, and lengthy reviews have a higher probability of abandonment and the rate of PR abandonment fluctuates alongside the projects' maturity or workload. To identify why contributors abandon their PRs, we also manually examine a random sample of 354 abandoned PRs. We observe that the most frequent abandonment reasons are related to the obstacles faced by contributors, followed by the hurdles imposed by maintainers during the review process. Finally, we survey the top core maintainers of the studied projects to understand their perspectives on dealing with PR abandonment and on our findings.
    Explaining black box text modules in natural language with language models. (arXiv:2305.09863v2 [cs.AI] UPDATED)
    Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A "text module" is any function that maps text to a scalar continuous value, such as a submodule within an LLM or a fitted model of a brain region. "Black box" indicates that we only have access to the module's inputs/outputs. We introduce Summarize and Score (SASC), a method that takes in a text module and returns a natural language explanation of the module's selectivity along with a score for how reliable the explanation is. We study SASC in 3 contexts. First, we evaluate SASC on synthetic modules and find that it often recovers ground truth explanations. Second, we use SASC to explain modules found within a pre-trained BERT model, enabling inspection of the model's internals. Finally, we show that SASC can generate explanations for the response of individual fMRI voxels to language stimuli, with potential applications to fine-grained brain mapping. All code for using SASC and reproducing results is made available on Github.
    Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems. (arXiv:2305.12102v3 [cs.LG] UPDATED)
    Learning high-quality feature embeddings efficiently and effectively is critical for the performance of web-scale machine learning systems. A typical model ingests hundreds of features with vocabularies on the order of millions to billions of tokens. The standard approach is to represent each feature value as a d-dimensional embedding, introducing hundreds of billions of parameters for extremely high-cardinality features. This bottleneck has led to substantial progress in alternative embedding algorithms. Many of these methods, however, make the assumption that each feature uses an independent embedding table. This work introduces a simple yet highly effective framework, Feature Multiplexing, where one single representation space is used across many different categorical features. Our theoretical and empirical analysis reveals that multiplexed embeddings can be decomposed into components from each constituent feature, allowing models to distinguish between features. We show that multiplexed representations lead to Pareto-optimal parameter-accuracy tradeoffs for three public benchmark datasets. Further, we propose a highly practical approach called Unified Embedding with three major benefits: simplified feature configuration, strong adaptation to dynamic data distributions, and compatibility with modern hardware. Unified embedding gives significant improvements in offline and online metrics compared to highly competitive baselines across five web-scale search, ads, and recommender systems, where it serves billions of users across the world in industry-leading products.
    SceneScore: Learning a Cost Function for Object Arrangement. (arXiv:2311.08530v1 [cs.RO])
    Arranging objects correctly is a key capability for robots which unlocks a wide range of useful tasks. A prerequisite for creating successful arrangements is the ability to evaluate the desirability of a given arrangement. Our method "SceneScore" learns a cost function for arrangements, such that desirable, human-like arrangements have a low cost. We learn the distribution of training arrangements offline using an energy-based model, solely from example images without requiring environment interaction or human supervision. Our model is represented by a graph neural network which learns object-object relations, using graphs constructed from images. Experiments demonstrate that the learned cost function can be used to predict poses for missing objects, generalise to novel objects using semantic features, and can be composed with other cost functions to satisfy constraints at inference time.
    Towards Graph-Aware Diffusion Modeling for Collaborative Filtering. (arXiv:2311.08744v1 [cs.IR])
    Recovering masked feedback with neural models is a popular paradigm in recommender systems. Seeing the success of diffusion models in solving ill-posed inverse problems, we introduce a conditional diffusion framework for collaborative filtering that iteratively reconstructs a user's hidden preferences guided by its historical interactions. To better align with the intrinsic characteristics of implicit feedback data, we implement forward diffusion by applying synthetic smoothing filters to interaction signals on an item-item graph. The resulting reverse diffusion can be interpreted as a personalized process that gradually refines preference scores. Through graph Fourier transform, we equivalently characterize this model as an anisotropic Gaussian diffusion in the graph spectral domain, establishing both forward and reverse formulations. Our model outperforms state-of-the-art methods by a large margin on one dataset and yields competitive results on the others.
    Machine Translation for Nko: Tools, Corpora and Baseline Results. (arXiv:2310.15612v3 [cs.CL] UPDATED)
    Currently, there is no usable machine translation system for Nko, a language spoken by tens of millions of people across multiple West African countries, which holds significant cultural and educational value. To address this issue, we present a set of tools, resources, and baseline results aimed towards the development of usable machine translation systems for Nko and other languages that do not currently have sufficiently large parallel text corpora available. (1) Fria$\parallel$el: A novel collaborative parallel text curation software that incorporates quality control through copyedit-based workflows. (2) Expansion of the FLoRes-200 and NLLB-Seed corpora with 2,009 and 6,193 high-quality Nko translations in parallel with 204 and 40 other languages. (3) nicolingua-0005: A collection of trilingual and bilingual corpora with 130,850 parallel segments and monolingual corpora containing over 3 million Nko words. (4) Baseline bilingual and multilingual neural machine translation results with the best model scoring 30.83 English-Nko chrF++ on FLoRes-devtest.
    Artificial Intelligence for Science in Quantum, Atomistic, and Continuum Systems. (arXiv:2307.08423v2 [cs.LG] UPDATED)
    Advances in artificial intelligence (AI) are fueling a new paradigm of discoveries in natural sciences. Today, AI has started to advance natural sciences by improving, accelerating, and enabling our understanding of natural phenomena at a wide range of spatial and temporal scales, giving rise to a new area of research known as AI for science (AI4Science). Being an emerging research paradigm, AI4Science is unique in that it is an enormous and highly interdisciplinary area. Thus, a unified and technical treatment of this field is needed yet challenging. This work aims to provide a technically thorough account of a subarea of AI4Science; namely, AI for quantum, atomistic, and continuum systems. These areas aim at understanding the physical world from the subatomic (wavefunctions and electron density), atomic (molecules, proteins, materials, and interactions), to macro (fluids, climate, and subsurface) scales and form an important subarea of AI4Science. A unique advantage of focusing on these areas is that they largely share a common set of challenges, thereby allowing a unified and foundational treatment. A key common challenge is how to capture physics first principles, especially symmetries, in natural systems by deep learning methods. We provide an in-depth yet intuitive account of techniques to achieve equivariance to symmetry transformations. We also discuss other common technical challenges, including explainability, out-of-distribution generalization, knowledge transfer with foundation and large language models, and uncertainty quantification. To facilitate learning and education, we provide categorized lists of resources that we found to be useful. We strive to be thorough and unified and hope this initial effort may trigger more community interests and efforts to further advance AI4Science.
    Counterfactual Explanation via Search in Gaussian Mixture Distributed Latent Space. (arXiv:2307.13390v2 [cs.LG] UPDATED)
    Counterfactual Explanations (CEs) are an important tool in Algorithmic Recourse for addressing two questions: 1. What are the crucial factors that led to an automated prediction/decision? 2. How can these factors be changed to achieve a more favorable outcome from a user's perspective? Thus, guiding the user's interaction with AI systems by proposing easy-to-understand explanations and easy-to-attain feasible changes is essential for the trustworthy adoption and long-term acceptance of AI systems. In the literature, various methods have been proposed to generate CEs, and different quality measures have been suggested to evaluate these methods. However, the generation of CEs is usually computationally expensive, and the resulting suggestions are unrealistic and thus non-actionable. In this paper, we introduce a new method to generate CEs for a pre-trained binary classifier by first shaping the latent space of an autoencoder to be a mixture of Gaussian distributions. CEs are then generated in latent space by linear interpolation between the query sample and the centroid of the target class. We show that our method maintains the characteristics of the input sample during the counterfactual search. In various experiments, we show that the proposed method is competitive based on different quality measures on image and tabular datasets -- efficiently returns results that are closer to the original data manifold compared to three state-of-the-art methods, which are essential for realistic high-dimensional machine learning applications.
    Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive Smoothing. (arXiv:2301.12554v3 [cs.LG] UPDATED)
    While prior research has proposed a plethora of methods that build neural classifiers robust against adversarial robustness, practitioners are still reluctant to adopt them due to their unacceptably severe clean accuracy penalties. This paper significantly alleviates this accuracy-robustness trade-off by mixing the output probabilities of a standard classifier and a robust classifier, where the standard network is optimized for clean accuracy and is not robust in general. We show that the robust base classifier's confidence difference for correct and incorrect examples is the key to this improvement. In addition to providing intuitions and empirical evidence, we theoretically certify the robustness of the mixed classifier under realistic assumptions. Furthermore, we adapt an adversarial input detector into a mixing network that adaptively adjusts the mixture of the two base models, further reducing the accuracy penalty of achieving robustness. The proposed flexible method, termed "adaptive smoothing", can work in conjunction with existing or even future methods that improve clean accuracy, robustness, or adversary detection. Our empirical evaluation considers strong attack methods, including AutoAttack and adaptive attack. On the CIFAR-100 dataset, our method achieves an 85.21% clean accuracy while maintaining a 38.72% $\ell_\infty$-AutoAttacked ($\epsilon = 8/255$) accuracy, becoming the second most robust method on the RobustBench CIFAR-100 benchmark as of submission, while improving the clean accuracy by ten percentage points compared with all listed models. The code that implements our method is available at https://github.com/Bai-YT/AdaptiveSmoothing.
    Machine-learning parameter tracking with partial state observation. (arXiv:2311.09142v1 [cs.LG])
    Complex and nonlinear dynamical systems often involve parameters that change with time, accurate tracking of which is essential to tasks such as state estimation, prediction, and control. Existing machine-learning methods require full state observation of the underlying system and tacitly assume adiabatic changes in the parameter. Formulating an inverse problem and exploiting reservoir computing, we develop a model-free and fully data-driven framework to accurately track time-varying parameters from partial state observation in real time. In particular, with training data from a subset of the dynamical variables of the system for a small number of known parameter values, the framework is able to accurately predict the parameter variations in time. Low- and high-dimensional, Markovian and non-Markovian nonlinear dynamical systems are used to demonstrate the power of the machine-learning based parameter-tracking framework. Pertinent issues affecting the tracking performance are addressed.
    Flexible numerical optimization with ensmallen. (arXiv:2003.04103v4 [cs.MS] UPDATED)
    This report provides an introduction to the ensmallen numerical optimization library, as well as a deep dive into the technical details of how it works. The library provides a fast and flexible C++ framework for mathematical optimization of arbitrary user-supplied functions. A large set of pre-built optimizers is provided, including many variants of Stochastic Gradient Descent and Quasi-Newton optimizers. Several types of objective functions are supported, including differentiable, separable, constrained, and categorical objective functions. Implementation of a new optimizer requires only one method, while a new objective function requires typically only one or two C++ methods. Through internal use of C++ template metaprogramming, ensmallen provides support for arbitrary user-supplied callbacks and automatic inference of unsupplied methods without any runtime overhead. Empirical comparisons show that ensmallen outperforms other optimization frameworks (such as Julia and SciPy), sometimes by large margins. The library is available at https://ensmallen.org and is distributed under the permissive BSD license.
    A Single-Loop Algorithm for Decentralized Bilevel Optimization. (arXiv:2311.08945v1 [math.OC])
    Bilevel optimization has received more and more attention recently due to its wide applications in machine learning. In this paper, we consider bilevel optimization in decentralized networks. In particular, we propose a novel single-loop algorithm for solving decentralized bilevel optimization with strongly convex lower level problem. Our algorithm is fully single-loop and does not require heavy matrix-vector multiplications when approximating the hypergradient. Moreover, unlike existing methods for decentralized bilevel optimization and federated bilevel optimization, our algorithm does not require any gradient heterogeneity assumption. Our analysis shows that the proposed algorithm achieves the best known convergence rate for bilevel optimization algorithms.
    Probabilistic Control and Majorization of Optimal Control. (arXiv:2205.03279v5 [cs.LG] UPDATED)
    Probabilistic control design is founded on the principle that a rational agent attempts to match modelled with an arbitrary desired closed-loop system trajectory density. The framework was originally proposed as a tractable alternative to traditional optimal control design, parametrizing desired behaviour through fictitious transition and policy densities and using the information projection as a proximity measure. In this work we introduce an alternative parametrization of desired closed-loop behaviour and explore alternative proximity measures between densities. It is then illustrated how the associated probabilistic control problems solve into uncertain or probabilistic policies. Our main result is to show that the probabilistic control objectives majorize conventional, stochastic and risk sensitive, optimal control objectives. This observation allows us to identify two probabilistic fixed point iterations that converge to the deterministic optimal control policies establishing an explicit connection between either formulations. Further we demonstrate that the risk sensitive optimal control formulation is also technically equivalent to a Maximum Likelihood estimation problem on a probabilistic graph model where the notion of costs is directly encoded into the model. The associated treatment of the estimation problem is then shown to coincide with the moment projected probabilistic control formulation. That way optimal decision making can be reformulated as an iterative inference problem. Based on these insights we discuss directions for algorithmic development.
    On the Importance of Step-wise Embeddings for Heterogeneous Clinical Time-Series. (arXiv:2311.08902v1 [cs.LG])
    Recent advances in deep learning architectures for sequence modeling have not fully transferred to tasks handling time-series from electronic health records. In particular, in problems related to the Intensive Care Unit (ICU), the state-of-the-art remains to tackle sequence classification in a tabular manner with tree-based methods. Recent findings in deep learning for tabular data are now surpassing these classical methods by better handling the severe heterogeneity of data input features. Given the similar level of feature heterogeneity exhibited by ICU time-series and motivated by these findings, we explore these novel methods' impact on clinical sequence modeling tasks. By jointly using such advances in deep learning for tabular data, our primary objective is to underscore the importance of step-wise embeddings in time-series modeling, which remain unexplored in machine learning methods for clinical data. On a variety of clinically relevant tasks from two large-scale ICU datasets, MIMIC-III and HiRID, our work provides an exhaustive analysis of state-of-the-art methods for tabular time-series as time-step embedding models, showing overall performance improvement. In particular, we evidence the importance of feature grouping in clinical time-series, with significant performance gains when considering features within predefined semantic groups in the step-wise embedding module.
    Causal prediction models for medication safety monitoring: The diagnosis of vancomycin-induced acute kidney injury. (arXiv:2311.09137v1 [cs.LG])
    The current best practice approach for the retrospective diagnosis of adverse drug events (ADEs) in hospitalized patients relies on a full patient chart review and a formal causality assessment by multiple medical experts. This evaluation serves to qualitatively estimate the probability of causation (PC); the probability that a drug was a necessary cause of an adverse event. This practice is manual, resource intensive and prone to human biases, and may thus benefit from data-driven decision support. Here, we pioneer a causal modeling approach using observational data to estimate a lower bound of the PC (PC$_{low}$). This method includes two key causal inference components: (1) the target trial emulation framework and (2) estimation of individualized treatment effects using machine learning. We apply our method to the clinically relevant use-case of vancomycin-induced acute kidney injury in intensive care patients, and compare our causal model-based PC$_{low}$ estimates to qualitative estimates of the PC provided by a medical expert. Important limitations and potential improvements are discussed, and we conclude that future improved causal models could provide essential data-driven support for medication safety monitoring in hospitalized patients.
    Error Bounds for Learning with Vector-Valued Random Features. (arXiv:2305.17170v2 [stat.ML] UPDATED)
    This paper provides a comprehensive error analysis of learning with vector-valued random features (RF). The theory is developed for RF ridge regression in a fully general infinite-dimensional input-output setting, but nonetheless applies to and improves existing finite-dimensional analyses. In contrast to comparable work in the literature, the approach proposed here relies on a direct analysis of the underlying risk functional and completely avoids the explicit RF ridge regression solution formula in terms of random matrices. This removes the need for concentration results in random matrix theory or their generalizations to random operators. The main results established in this paper include strong consistency of vector-valued RF estimators under model misspecification and minimax optimal convergence rates in the well-specified setting. The parameter complexity (number of random features) and sample complexity (number of labeled data) required to achieve such rates are comparable with Monte Carlo intuition and free from logarithmic factors.
    On the Foundation of Distributionally Robust Reinforcement Learning. (arXiv:2311.09018v1 [cs.LG])
    Motivated by the need for a robust policy in the face of environment shifts between training and the deployment, we contribute to the theoretical foundation of distributionally robust reinforcement learning (DRRL). This is accomplished through a comprehensive modeling framework centered around distributionally robust Markov decision processes (DRMDPs). This framework obliges the decision maker to choose an optimal policy under the worst-case distributional shift orchestrated by an adversary. By unifying and extending existing formulations, we rigorously construct DRMDPs that embraces various modeling attributes for both the decision maker and the adversary. These attributes include adaptability granularity, exploring history-dependent, Markov, and Markov time-homogeneous decision maker and adversary dynamics. Additionally, we delve into the flexibility of shifts induced by the adversary, examining SA and S-rectangularity. Within this DRMDP framework, we investigate conditions for the existence or absence of the dynamic programming principle (DPP). From an algorithmic standpoint, the existence of DPP holds significant implications, as the vast majority of existing data and computationally efficiency RL algorithms are reliant on the DPP. To study its existence, we comprehensively examine combinations of controller and adversary attributes, providing streamlined proofs grounded in a unified methodology. We also offer counterexamples for settings in which a DPP with full generality is absent.
    Two-stage Joint Transductive and Inductive learning for Nuclei Segmentation. (arXiv:2311.08774v1 [eess.IV])
    AI-assisted nuclei segmentation in histopathological images is a crucial task in the diagnosis and treatment of cancer diseases. It decreases the time required to manually screen microscopic tissue images and can resolve the conflict between pathologists during diagnosis. Deep Learning has proven useful in such a task. However, lack of labeled data is a significant barrier for deep learning-based approaches. In this study, we propose a novel approach to nuclei segmentation that leverages the available labelled and unlabelled data. The proposed method combines the strengths of both transductive and inductive learning, which have been previously attempted separately, into a single framework. Inductive learning aims at approximating the general function and generalizing to unseen test data, while transductive learning has the potential of leveraging the unlabelled test data to improve the classification. To the best of our knowledge, this is the first study to propose such a hybrid approach for medical image segmentation. Moreover, we propose a novel two-stage transductive inference scheme. We evaluate our approach on MoNuSeg benchmark to demonstrate the efficacy and potential of our method.
    Frequency Domain-based Dataset Distillation. (arXiv:2311.08819v1 [cs.LG])
    This paper presents FreD, a novel parameterization method for dataset distillation, which utilizes the frequency domain to distill a small-sized synthetic dataset from a large-sized original dataset. Unlike conventional approaches that focus on the spatial domain, FreD employs frequency-based transforms to optimize the frequency representations of each data instance. By leveraging the concentration of spatial domain information on specific frequency components, FreD intelligently selects a subset of frequency dimensions for optimization, leading to a significant reduction in the required budget for synthesizing an instance. Through the selection of frequency dimensions based on the explained variance, FreD demonstrates both theoretical and empirical evidence of its ability to operate efficiently within a limited budget, while better preserving the information of the original dataset compared to conventional parameterization methods. Furthermore, based on the orthogonal compatibility of FreD with existing methods, we confirm that FreD consistently improves the performances of existing distillation methods over the evaluation scenarios with different benchmark datasets. We release the code at https://github.com/sdh0818/FreD.
    Using Representation Expressiveness and Learnability to Evaluate Self-Supervised Learning Methods. (arXiv:2206.01251v2 [cs.LG] UPDATED)
    We address the problem of evaluating the quality of self-supervised learning (SSL) models without access to supervised labels, while being agnostic to the architecture, learning algorithm or data manipulation used during training. We argue that representations can be evaluated through the lens of expressiveness and learnability. We propose to use the Intrinsic Dimension (ID) to assess expressiveness and introduce Cluster Learnability (CL) to assess learnability. CL is measured in terms of the performance of a KNN classifier trained to predict labels obtained by clustering the representations with K-means. We thus combine CL and ID into a single predictor -- CLID. Through a large-scale empirical study with a diverse family of SSL algorithms, we find that CLID better correlates with in-distribution model performance than other competing recent evaluation schemes. We also benchmark CLID on out-of-domain generalization, where CLID serves as a predictor of the transfer performance of SSL models on several visual classification tasks, yielding improvements with respect to the competing baselines.
    Predicting the First Response Latency of Maintainers and Contributors in Pull Requests. (arXiv:2311.07786v1 [cs.SE] CROSS LISTED)
    The success of a Pull Request (PR) depends on the responsiveness of the maintainers and the contributor during the review process. Being aware of the expected waiting times can lead to better interactions and managed expectations for both the maintainers and the contributor. In this paper, we propose a machine-learning approach to predict the first response latency of the maintainers following the submission of a PR, and the first response latency of the contributor after receiving the first response from the maintainers. We curate a dataset of 20 large and popular open-source projects on GitHub and extract 21 features to characterize projects, contributors, PRs, and review processes. Using these features, we then evaluate seven types of classifiers to identify the best-performing models. We also perform permutation feature importance and SHAP analyses to understand the importance and impact of different features on the predicted response latencies. Our best-performing models achieve an average improvement of 33% in AUC-ROC and 58% in AUC-PR for maintainers, as well as 42% in AUC-ROC and 95% in AUC-PR for contributors compared to a no-skilled classifier across the projects. Our findings indicate that PRs submitted earlier in the week, containing an average or slightly above-average number of commits, and with concise descriptions are more likely to receive faster first responses from the maintainers. Similarly, PRs with a lower first response latency from maintainers, that received the first response of maintainers earlier in the week, and containing an average or slightly above-average number of commits tend to receive faster first responses from the contributors. Additionally, contributors with a higher acceptance rate and a history of timely responses in the project are likely to both obtain and provide faster first responses.
    Non-Uniform Smoothness for Gradient Descent. (arXiv:2311.08615v1 [math.OC])
    The analysis of gradient descent-type methods typically relies on the Lipschitz continuity of the objective gradient. This generally requires an expensive hyperparameter tuning process to appropriately calibrate a stepsize for a given problem. In this work we introduce a local first-order smoothness oracle (LFSO) which generalizes the Lipschitz continuous gradients smoothness condition and is applicable to any twice-differentiable function. We show that this oracle can encode all relevant problem information for tuning stepsizes for a suitably modified gradient descent method and give global and local convergence results. We also show that LFSOs in this modified first-order method can yield global linear convergence rates for non-strongly convex problems with extremely flat minima, and thus improve over the lower bound on rates achievable by general (accelerated) first-order methods.
    Extracting Diagnosis Pathways from Electronic Health Records Using Deep Reinforcement Learning. (arXiv:2305.06295v3 [cs.LG] UPDATED)
    Clinical diagnosis guidelines aim at specifying the steps that may lead to a diagnosis. Inspired by guidelines, we aim to learn the optimal sequence of actions to perform in order to obtain a correct diagnosis from electronic health records. We apply various deep reinforcement learning algorithms to this task and experiment on a synthetic but realistic dataset to differentially diagnose anemia and its subtypes and particularly evaluate the robustness of various approaches to noise and missing data. Experimental results show that the deep reinforcement learning algorithms show competitive performance compared to the state-of-the-art methods with the added advantage that they enable the progressive generation of a pathway to the suggested diagnosis, which can both guide and explain the decision process.
    Towards Multi-User Activity Recognition through Facilitated Training Data and Deep Learning for Human-Robot Collaboration Applications. (arXiv:2302.05763v4 [cs.LG] UPDATED)
    Human-robot interaction (HRI) research is progressively addressing multi-party scenarios, where a robot interacts with more than one human user at the same time. Conversely, research is still at an early stage for human-robot collaboration. The use of machine learning techniques to handle such type of collaboration requires data that are less feasible to produce than in a typical HRC setup. This work outlines scenarios of concurrent tasks for non-dyadic HRC applications. Based upon these concepts, this study also proposes an alternative way of gathering data regarding multi-user activity, by collecting data related to single users and merging them in post-processing, to reduce the effort involved in producing recordings of pair settings. To validate this statement, 3D skeleton poses of activity of single users were collected and merged in pairs. After this, such datapoints were used to separately train a long short-term memory (LSTM) network and a variational autoencoder (VAE) composed of spatio-temporal graph convolutional networks (STGCN) to recognise the joint activities of the pairs of people. The results showed that it is possible to make use of data collected in this way for pair HRC settings and get similar performances compared to using training data regarding groups of users recorded under the same settings, relieving from the technical difficulties involved in producing these data. The related code and collected data are publicly available.
    Supervised and Penalized Baseline Correction. (arXiv:2310.18306v2 [stat.ML] UPDATED)
    Spectroscopic measurements can show distorted spectral shapes arising from a mixture of absorbing and scattering contributions. These distortions (or baselines) often manifest themselves as non-constant offsets or low-frequency oscillations. As a result, these baselines can adversely affect analytical and quantitative results. Baseline correction is an umbrella term where one applies pre-processing methods to obtain baseline spectra (the unwanted distortions) and then remove the distortions by differencing. However, current state-of-the art baseline correction methods do not utilize analyte concentrations even if they are available, or even if they contribute significantly to the observed spectral variability. We examine a class of state-of-the-art methods (penalized baseline correction) and modify them such that they can accommodate a priori analyte concentrations such that prediction can be enhanced. Performance will be assessed on two near infra-red data sets across both classical penalized baseline correction methods (without analyte information) and modified penalized baseline correction methods (leveraging analyte information).
    Towards Characterizing Domain Counterfactuals For Invertible Latent Causal Models. (arXiv:2306.11281v2 [cs.LG] UPDATED)
    Answering counterfactual queries has many important applications such as knowledge discovery and explainability, but is challenging when causal variables are unobserved and we only see a projection onto an observation space, for instance, image pixels. One approach is to recover the latent Structural Causal Model (SCM), but this typically needs unrealistic assumptions, such as linearity of the causal mechanisms. Another approach is to use na\"ive ML approximations, such as generative models, to generate counterfactual samples; however, these lack guarantees of accuracy. In this work, we strive to strike a balance between practicality and theoretical guarantees by focusing on a specific type of causal query called domain counterfactuals, which hypothesizes what a sample would have looked like if it had been generated in a different domain (or environment). Concretely, by only assuming invertibility, sparse domain interventions and access to observational data from different domains, we aim to improve domain counterfactual estimation both theoretically and practically with less restrictive assumptions. We define domain counterfactually equivalent models and prove necessary and sufficient properties for equivalent models that provide a tight characterization of the domain counterfactual equivalence classes. Building upon this result, we prove that every equivalence class contains a model where all intervened variables are at the end when topologically sorted by the causal DAG. This surprising result suggests that a model design that only allows intervention in the last $k$ latent variables may improve model estimation for counterfactuals. We then test this model design on extensive simulated and image-based experiments which show the sparse canonical model indeed improves counterfactual estimation over baseline non-sparse models.
    Environment-independent mmWave Fall Detection with Interacting Multiple Model. (arXiv:2311.08755v1 [eess.SP])
    The ageing society brings attention to daily elderly care through sensing technologies. The future smart home is expected to enable in-home daily monitoring, such as fall detection, for seniors in a non-invasive, non-cooperative, and non-contact manner. The mmWave radar is a promising candidate technology for its privacy-preserving and non-contact manner. However, existing solutions suffer from low accuracy and robustness due to environment dependent features. In this paper, we present FADE (\underline{FA}ll \underline{DE}tection), a practical fall detection radar system with enhanced accuracy and robustness in real-world scenarios. The key enabler underlying FADE is an interacting multiple model (IMM) state estimator that can extract environment-independent features for highly accurate and instantaneous fall detection. Furthermore, we proposed a robust multiple-user tracking system to deal with noises from the environment and other human bodies. We deployed our algorithm on low computing power and low power consumption system-on-chip (SoC) composed of data front end, DSP, and ARM processor, and tested its performance in real-world. The experiment shows that the accuracy of fall detection is up to 95\%.
    Synthesizing Missing MRI Sequences from Available Modalities using Generative Adversarial Networks in BraTS Dataset. (arXiv:2310.07250v3 [q-bio.QM] UPDATED)
    Glioblastoma is a highly aggressive and lethal form of brain cancer. Magnetic resonance imaging (MRI) plays a significant role in the diagnosis, treatment planning, and follow-up of glioblastoma patients due to its non-invasive and radiation-free nature. The International Brain Tumor Segmentation (BraTS) challenge has contributed to generating numerous AI algorithms to accurately and efficiently segment glioblastoma sub-compartments using four structural (T1, T1Gd, T2, T2-FLAIR) MRI scans. However, these four MRI sequences may not always be available. To address this issue, Generative Adversarial Networks (GANs) can be used to synthesize the missing MRI sequences. In this paper, we implement and utilize an open-source GAN approach that takes any three MRI sequences as input to generate the missing fourth structural sequence. Our proposed approach is contributed to the community-driven generally nuanced deep learning framework (GaNDLF) and demonstrates promising results in synthesizing high-quality and realistic MRI sequences, enabling clinicians to improve their diagnostic capabilities and support the application of AI methods to brain tumor MRI quantification.
    Approaching adverse event detection utilizing transformers on clinical time-series. (arXiv:2311.09165v1 [cs.LG])
    Patients being admitted to a hospital will most often be associated with a certain clinical development during their stay. However, there is always a risk of patients being subject to the wrong diagnosis or to a certain treatment not pertaining to the desired effect, potentially leading to adverse events. Our research aims to develop an anomaly detection system for identifying deviations from expected clinical trajectories. To address this goal we analyzed 16 months of vital sign recordings obtained from the Nordland Hospital Trust (NHT). We employed an self-supervised framework based on the STraTS transformer architecture to represent the time series data in a latent space. These representations were then subjected to various clustering techniques to explore potential patient phenotypes based on their clinical progress. While our preliminary results from this ongoing research are promising, they underscore the importance of enhancing the dataset with additional demographic information from patients. This additional data will be crucial for a more comprehensive evaluation of the method's performance.
    Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models. (arXiv:2310.13913v2 [cs.LG] UPDATED)
    Protein-ligand structure prediction is an essential task in drug discovery, predicting the binding interactions between small molecules (ligands) and target proteins (receptors). Although conventional physics-based docking tools are widely utilized, their accuracy is compromised by limited conformational sampling and imprecise scoring functions. Recent advances have incorporated deep learning techniques to improve the accuracy of structure prediction. Nevertheless, the experimental validation of docking conformations remains costly, it raises concerns regarding the generalizability of these deep learning-based methods due to the limited training data. In this work, we show that by pre-training a geometry-aware SE(3)-Equivariant neural network on a large-scale docking conformation generated by traditional physics-based docking tools and then fine-tuning with a limited set of experimentally validated receptor-ligand complexes, we can achieve outstanding performance. This process involved the generation of 100 million docking conformations, consuming roughly 1 million CPU core days. The proposed model, HelixDock, aims to acquire the physical knowledge encapsulated by the physics-based docking tools during the pre-training phase. HelixDock has been benchmarked against both physics-based and deep learning-based baselines, showing that it outperforms its closest competitor by over 40% for RMSD. HelixDock also exhibits enhanced performance on a dataset that poses a greater challenge, thereby highlighting its robustness. Moreover, our investigation reveals the scaling laws governing pre-trained structure prediction models, indicating a consistent enhancement in performance with increases in model parameters and pre-training data. This study illuminates the strategic advantage of leveraging a vast and varied repository of generated data to advance the frontiers of AI-driven drug discovery.
    Resetting the Optimizer in Deep RL: An Empirical Study. (arXiv:2306.17833v2 [cs.LG] UPDATED)
    We focus on the task of approximating the optimal value function in deep reinforcement learning. This iterative process is comprised of solving a sequence of optimization problems where the loss function changes per iteration. The common approach to solving this sequence of problems is to employ modern variants of the stochastic gradient descent algorithm such as Adam. These optimizers maintain their own internal parameters such as estimates of the first-order and the second-order moments of the gradient, and update them over time. Therefore, information obtained in previous iterations is used to solve the optimization problem in the current iteration. We demonstrate that this can contaminate the moment estimates because the optimization landscape can change arbitrarily from one iteration to the next one. To hedge against this negative effect, a simple idea is to reset the internal parameters of the optimizer when starting a new iteration. We empirically investigate this resetting idea by employing various optimizers in conjunction with the Rainbow algorithm. We demonstrate that this simple modification significantly improves the performance of deep RL on the Atari benchmark.
    DASVDD: Deep Autoencoding Support Vector Data Descriptor for Anomaly Detection. (arXiv:2106.05410v3 [cs.LG] UPDATED)
    Semi-supervised anomaly detection aims to detect anomalies from normal samples using a model that is trained on normal data. With recent advancements in deep learning, researchers have designed efficient deep anomaly detection methods. Existing works commonly use neural networks to map the data into a more informative representation and then apply an anomaly detection algorithm. In this paper, we propose a method, DASVDD, that jointly learns the parameters of an autoencoder while minimizing the volume of an enclosing hyper-sphere on its latent representation. We propose an anomaly score which is a combination of autoencoder's reconstruction error and the distance from the center of the enclosing hypersphere in the latent representation. Minimizing this anomaly score aids us in learning the underlying distribution of the normal class during training. Including the reconstruction error in the anomaly score ensures that DASVDD does not suffer from the common hypersphere collapse issue since the DASVDD model does not converge to the trivial solution of mapping all inputs to a constant point in the latent representation. Experimental evaluations on several benchmark datasets show that the proposed method outperforms the commonly used state-of-the-art anomaly detection algorithms while maintaining robust performance across different anomaly classes.
    Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations. (arXiv:2311.08815v1 [cs.LG])
    Self-supervised representation learning often uses data augmentations to induce some invariance to "style" attributes of the data. However, with downstream tasks generally unknown at training time, it is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded. To address this, we introduce a more principled approach that seeks to disentangle style features rather than discard them. The key idea is to add multiple style embedding spaces where: (i) each is invariant to all-but-one augmentation; and (ii) joint entropy is maximized. We formalize our structured data-augmentation procedure from a causal latent-variable-model perspective, and prove identifiability of both content and (multiple blocks of) style variables. We empirically demonstrate the benefits of our approach on synthetic datasets and then present promising but limited results on ImageNet.
    From random-walks to graph-sprints: a low-latency node embedding framework on continuous-time dynamic graphs. (arXiv:2307.08433v3 [cs.LG] UPDATED)
    Many real-world datasets have an underlying dynamic graph structure, where entities and their interactions evolve over time. Machine learning models should consider these dynamics in order to harness their full potential in downstream tasks. Previous approaches for graph representation learning have focused on either sampling k-hop neighborhoods, akin to breadth-first search, or random walks, akin to depth-first search. However, these methods are computationally expensive and unsuitable for real-time, low-latency inference on dynamic graphs. To overcome these limitations, we propose graph-sprints a general purpose feature extraction framework for continuous-time-dynamic-graphs (CTDGs) that has low latency and is competitive with state-of-the-art, higher latency models. To achieve this, a streaming, low latency approximation to the random-walk based features is proposed. In our framework, time-aware node embeddings summarizing multi-hop information are computed using only single-hop operations on the incoming edges. We evaluate our proposed approach on three open-source datasets and two in-house datasets, and compare with three state-of-the-art algorithms (TGN-attn, TGN-ID, Jodie). We demonstrate that our graph-sprints features, combined with a machine learning classifier, achieve competitive performance (outperforming all baselines for the node classification tasks in five datasets). Simultaneously, graph-sprints significantly reduce inference latencies, achieving close to an order of magnitude speed-up in our experimental setting.
    On the Computation of the Gaussian Rate-Distortion-Perception Function. (arXiv:2311.09190v1 [cs.IT])
    In this paper, we study the computation of the rate-distortion-perception function (RDPF) for a multivariate Gaussian source under mean squared error (MSE) distortion and, respectively, Kullback-Leibler divergence, geometric Jensen-Shannon divergence, squared Hellinger distance, and squared Wasserstein-2 distance perception metrics. To this end, we first characterize the analytical bounds of the scalar Gaussian RDPF for the aforementioned divergence functions, also providing the RDPF-achieving forward "test-channel" realization. Focusing on the multivariate case, we establish that, for tensorizable distortion and perception metrics, the optimal solution resides on the vector space spanned by the eigenvector of the source covariance matrix. Consequently, the multivariate optimization problem can be expressed as a function of the scalar Gaussian RDPFs of the source marginals, constrained by global distortion and perception levels. Leveraging this characterization, we design an alternating minimization scheme based on the block nonlinear Gauss-Seidel method, which optimally solves the problem while identifying the Gaussian RDPF-achieving realization. Furthermore, the associated algorithmic embodiment is provided, as well as the convergence and the rate of convergence characterization. Lastly, for the "perfect realism" regime, the analytical solution for the multivariate Gaussian RDPF is obtained. We corroborate our results with numerical simulations and draw connections to existing results.
    Learning Fair Division from Bandit Feedback. (arXiv:2311.09068v1 [cs.LG])
    This work addresses learning online fair division under uncertainty, where a central planner sequentially allocates items without precise knowledge of agents' values or utilities. Departing from conventional online algorithm, the planner here relies on noisy, estimated values obtained after allocating items. We introduce wrapper algorithms utilizing \textit{dual averaging}, enabling gradual learning of both the type distribution of arriving items and agents' values through bandit feedback. This approach enables the algorithms to asymptotically achieve optimal Nash social welfare in linear Fisher markets with agents having additive utilities. We establish regret bounds in Nash social welfare and empirically validate the superior performance of our proposed algorithms across synthetic and empirical datasets.
    A Unified Approach to Learning Ising Models: Beyond Independence and Bounded Width. (arXiv:2311.09197v1 [cs.LG])
    We revisit the problem of efficiently learning the underlying parameters of Ising models from data. Current algorithmic approaches achieve essentially optimal sample complexity when given i.i.d. samples from the stationary measure and the underlying model satisfies "width" bounds on the total $\ell_1$ interaction involving each node. We show that a simple existing approach based on node-wise logistic regression provably succeeds at recovering the underlying model in several new settings where these assumptions are violated: (1) Given dynamically generated data from a wide variety of local Markov chains, like block or round-robin dynamics, logistic regression recovers the parameters with optimal sample complexity up to $\log\log n$ factors. This generalizes the specialized algorithm of Bresler, Gamarnik, and Shah [IEEE Trans. Inf. Theory'18] for structure recovery in bounded degree graphs from Glauber dynamics. (2) For the Sherrington-Kirkpatrick model of spin glasses, given $\mathsf{poly}(n)$ independent samples, logistic regression recovers the parameters in most of the known high-temperature regime via a simple reduction to weaker structural properties of the measure. This improves on recent work of Anari, Jain, Koehler, Pham, and Vuong [ArXiv'23] which gives distribution learning at higher temperature. (3) As a simple byproduct of our techniques, logistic regression achieves an exponential improvement in learning from samples in the M-regime of data considered by Dutt, Lokhov, Vuffray, and Misra [ICML'21] as well as novel guarantees for learning from the adversarial Glauber dynamics of Chin, Moitra, Mossel, and Sandon [ArXiv'23]. Our approach thus significantly generalizes the elegant analysis of Wu, Sanghavi, and Dimakis [Neurips'19] without any algorithmic modification.
    Supervised learning with probabilistic morphisms and kernel mean embeddings. (arXiv:2305.06348v5 [math.ST] UPDATED)
    In this paper I propose a generative model of supervised learning that unifies two approaches to supervised learning, using a concept of a correct loss function. Addressing two measurability problems, which have been ignored in statistical learning theory, I propose to use convergence in outer probability to characterize the consistency of a learning algorithm. Building upon these results, I extend a result due to Cucker-Smale, which addresses the learnability of a regression model, to the setting of a conditional probability estimation problem. Additionally, I present a variant of Vapnik-Stefanuyk's regularization method for solving stochastic ill-posed problems, and using it to prove the generalizability of overparameterized supervised learning models.
    X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects. (arXiv:2311.08788v1 [cs.CL])
    Natural Language Generation (NLG) typically involves evaluating the generated text in various aspects (e.g., consistency and naturalness) to obtain a comprehensive assessment. However, multi-aspect evaluation remains challenging as it may require the evaluator to generalize to any given evaluation aspect even if it's absent during training. In this paper, we introduce X-Eval, a two-stage instruction tuning framework to evaluate the text in both seen and unseen aspects customized by end users. X-Eval consists of two learning stages: the vanilla instruction tuning stage that improves the model's ability to follow evaluation instructions, and an enhanced instruction tuning stage that exploits the connections between fine-grained evaluation aspects to better assess text quality. To support the training of X-Eval, we collect AspectInstruct, the first instruction tuning dataset tailored for multi-aspect NLG evaluation spanning 27 diverse evaluation aspects with 65 tasks. To enhance task diversity, we devise an augmentation strategy that converts human rating annotations into diverse forms of NLG evaluation tasks, including scoring, comparison, ranking, and Boolean question answering. Extensive experiments across three essential categories of NLG tasks: dialogue generation, summarization, and data-to-text coupled with 21 aspects in meta-evaluation, demonstrate that our X-Eval enables even a lightweight language model to achieve a comparable if not higher correlation with human judgments compared to the state-of-the-art NLG evaluators, such as GPT-4.
    ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment. (arXiv:2305.14463v2 [cs.CL] UPDATED)
    We present a systematic study and comprehensive evaluation of large language models for automatic multilingual readability assessment. In particular, we construct ReadMe++, a multilingual multi-domain dataset with human annotations of 9757 sentences in Arabic, English, French, Hindi, and Russian collected from 112 different data sources. ReadMe++ offers more domain and language diversity than existing readability datasets, making it ideal for benchmarking multilingual and non-English language models (including mBERT, XLM-R, mT5, Llama-2, GPT-4, etc.) in the supervised, unsupervised, and few-shot prompting settings. Our experiments reveal that models fine-tuned on ReadMe++ outperform those trained on single-domain datasets, showcasing superior performance on multi-domain readability assessment and cross-lingual transfer capabilities. We also compare to traditional readability metrics (such as Flesch-Kincaid Grade Level and Open Source Metric for Measuring Arabic Narratives), as well as the state-of-the-art unsupervised metric RSRS (Martinc et al., 2021). We will make our data and code publicly available at: https://github.com/tareknaous/readme.
    $FastDoc$: Domain-Specific Fast Pre-training Technique using Document-Level Metadata and Taxonomy. (arXiv:2306.06190v2 [cs.CL] UPDATED)
    As the demand for sophisticated Natural Language Processing (NLP) models continues to grow, so does the need for efficient pre-training techniques. Current NLP models undergo resource-intensive pre-training. In response, we introduce $FastDoc$ (Fast Pre-training Technique using Document-Level Metadata and Taxonomy), a novel approach designed to significantly reduce computational demands. $FastDoc$ leverages document metadata and domain-specific taxonomy as supervision signals. It involves continual pre-training of an open-domain transformer encoder using sentence-level embeddings, followed by fine-tuning using token-level embeddings. We evaluate $FastDoc$ on six tasks across nine datasets spanning three distinct domains. Remarkably, $FastDoc$ achieves remarkable compute reductions of approximately 1,000x, 4,500x, 500x compared to competitive approaches in Customer Support, Scientific, and Legal domains, respectively. Importantly, these efficiency gains do not compromise performance relative to competitive baselines. Furthermore, reduced pre-training data mitigates catastrophic forgetting, ensuring consistent performance in open-domain scenarios. $FastDoc$ offers a promising solution for resource-efficient pre-training, with potential applications spanning various domains.
    Towards A Unified View of Answer Calibration for Multi-Step Reasoning. (arXiv:2311.09101v1 [cs.CL])
    Large Language Models (LLMs) employing Chain-of-Thought (CoT) prompting have broadened the scope for improving multi-step reasoning capabilities. Usually, answer calibration strategies such as step-level or path-level calibration play a vital role in multi-step reasoning. While effective, there remains a significant gap in our understanding of the key factors that drive their success. In this paper, we break down the design of recent answer calibration strategies and present a unified view which establishes connections between them. We then conduct a thorough evaluation on these strategies from a unified view, systematically scrutinizing step-level and path-level answer calibration across multiple paths. Our study holds the potential to illuminate key insights for optimizing multi-step reasoning with answer calibration.
    On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-$n$ Recommendation. (arXiv:2307.15053v2 [cs.IR] UPDATED)
    Approaches to recommendation are typically evaluated in one of two ways: (1) via a (simulated) online experiment, often seen as the gold standard, or (2) via some offline evaluation procedure, where the goal is to approximate the outcome of an online experiment. Several offline evaluation metrics have been adopted in the literature, inspired by ranking metrics prevalent in the field of Information Retrieval. (Normalised) Discounted Cumulative Gain (nDCG) is one such metric that has seen widespread adoption in empirical studies, and higher (n)DCG values have been used to present new methods as the state-of-the-art in top-$n$ recommendation for many years. Our work takes a critical look at this approach, and investigates when we can expect such metrics to approximate the gold standard outcome of an online experiment. We formally present the assumptions that are necessary to consider DCG an unbiased estimator of online reward and provide a derivation for this metric from first principles, highlighting where we deviate from its traditional uses in IR. Importantly, we show that normalising the metric renders it inconsistent, in that even when DCG is unbiased, ranking competing methods by their normalised DCG can invert their relative order. Through a correlation analysis between off- and on-line experiments conducted on a large-scale recommendation platform, we show that our unbiased DCG estimates strongly correlate with online reward, even when some of the metric's inherent assumptions are violated. This statement no longer holds for its normalised variant, suggesting that nDCG's practical utility may be limited.
    Statistical learning by sparse deep neural networks. (arXiv:2311.08845v1 [math.ST])
    We consider a deep neural network estimator based on empirical risk minimization with l_1-regularization. We derive a general bound for its excess risk in regression and classification (including multiclass), and prove that it is adaptively nearly-minimax (up to log-factors) simultaneously across the entire range of various function classes.
    Deep Learning for Two-Sided Matching. (arXiv:2107.03427v2 [cs.GT] UPDATED)
    We initiate the study of deep learning for the automated design of two-sided matching mechanisms. What is of most interest is to use machine learning to understand the possibility of new tradeoffs between strategy-proofness and stability. These properties cannot be achieved simultaneously, but the efficient frontier is not understood. We introduce novel differentiable surrogates for quantifying ordinal strategy-proofness and stability and use them to train differentiable matching mechanisms that map discrete preferences to valid randomized matchings. We demonstrate that the efficient frontier characterized by these learned mechanisms is substantially better than that achievable through a convex combination of baselines of deferred acceptance (stable and strategy-proof for only one side of the market), top trading cycles (strategy-proof for one side, but not stable), and randomized serial dictatorship (strategy-proof for both sides, but not stable). This gives a new target for economic theory and opens up new possibilities for machine learning pipelines in matching market design.
    OVeNet: Offset Vector Network for Semantic Segmentation. (arXiv:2303.14516v2 [cs.CV] UPDATED)
    Semantic segmentation is a fundamental task in visual scene understanding. We focus on the supervised setting, where ground-truth semantic annotations are available. Based on knowledge about the high regularity of real-world scenes, we propose a method for improving class predictions by learning to selectively exploit information from neighboring pixels. In particular, our method is based on the prior that for each pixel, there is a seed pixel in its close neighborhood sharing the same prediction with the former. Motivated by this prior, we design a novel two-head network, named Offset Vector Network (OVeNet), which generates both standard semantic predictions and a dense 2D offset vector field indicating the offset from each pixel to the respective seed pixel, which is used to compute an alternative, seed-based semantic prediction. The two predictions are adaptively fused at each pixel using a learnt dense confidence map for the predicted offset vector field. We supervise offset vectors indirectly via optimizing the seed-based prediction and via a novel loss on the confidence map. Compared to the baseline state-of-the-art architectures HRNet and HRNet+OCR on which OVeNet is built, the latter achieves significant performance gains on three prominent benchmarks for semantic segmentation, namely Cityscapes, ACDC and ADE20K. Code is available at https://github.com/stamatisalex/OVeNet
    Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task. (arXiv:2310.09336v2 [cs.LG] UPDATED)
    Modern generative models exhibit unprecedented capabilities to generate extremely realistic data. However, given the inherent compositionality of the real world, reliable use of these models in practical applications requires that they exhibit the capability to compose a novel set of concepts to generate outputs not seen in the training data set. Prior work demonstrates that recent diffusion models do exhibit intriguing compositional generalization abilities, but also fail unpredictably. Motivated by this, we perform a controlled study for understanding compositional generalization in conditional diffusion models in a synthetic setting, varying different attributes of the training data and measuring the model's ability to generate samples out-of-distribution. Our results show: (i) the order in which the ability to generate samples from a concept and compose them emerges is governed by the structure of the underlying data-generating process; (ii) performance on compositional tasks exhibits a sudden "emergence" due to multiplicative reliance on the performance of constituent tasks, partially explaining emergent phenomena seen in generative models; and (iii) composing concepts with lower frequency in the training data to generate out-of-distribution samples requires considerably more optimization steps compared to generating in-distribution samples. Overall, our study lays a foundation for understanding capabilities and compositionality in generative models from a data-centric perspective.
    Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts. (arXiv:2311.09127v1 [cs.CR])
    Existing work on jailbreak Multimodal Large Language Models (MLLMs) has focused primarily on adversarial examples in model inputs, with less attention to vulnerabilities in model APIs. To fill the research gap, we carry out the following work: 1) We discover a system prompt leakage vulnerability in GPT-4V. Through carefully designed dialogue, we successfully steal the internal system prompts of GPT-4V. This finding indicates potential exploitable security risks in MLLMs; 2)Based on the acquired system prompts, we propose a novel MLLM jailbreaking attack method termed SASP (Self-Adversarial Attack via System Prompt). By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts. Furthermore, in pursuit of better performance, we also add human modification based on GPT-4's analysis, which further improves the attack success rate to 98.7\%; 3) We evaluated the effect of modifying system prompts to defend against jailbreaking attacks. Results show that appropriately designed system prompts can significantly reduce jailbreak success rates. Overall, our work provides new insights into enhancing MLLM security, demonstrating the important role of system prompts in jailbreaking, which could be leveraged to greatly facilitate jailbreak success rates while also holding the potential for defending against jailbreaks.
    Data Augmentations in Deep Weight Spaces. (arXiv:2311.08851v1 [cs.LG])
    Learning in weight spaces, where neural networks process the weights of other deep neural networks, has emerged as a promising research direction with applications in various fields, from analyzing and editing neural fields and implicit neural representations, to network pruning and quantization. Recent works designed architectures for effective learning in that space, which takes into account its unique, permutation-equivariant, structure. Unfortunately, so far these architectures suffer from severe overfitting and were shown to benefit from large datasets. This poses a significant challenge because generating data for this learning setup is laborious and time-consuming since each data sample is a full set of network weights that has to be trained. In this paper, we address this difficulty by investigating data augmentations for weight spaces, a set of techniques that enable generating new data examples on the fly without having to train additional input weight space elements. We first review several recently proposed data augmentation schemes %that were proposed recently and divide them into categories. We then introduce a novel augmentation scheme based on the Mixup method. We evaluate the performance of these techniques on existing benchmarks as well as new benchmarks we generate, which can be valuable for future studies.
    Semidefinite programs simulate approximate message passing robustly. (arXiv:2311.09017v1 [cs.DS])
    Approximate message passing (AMP) is a family of iterative algorithms that generalize matrix power iteration. AMP algorithms are known to optimally solve many average-case optimization problems. In this paper, we show that a large class of AMP algorithms can be simulated in polynomial time by \emph{local statistics hierarchy} semidefinite programs (SDPs), even when an unknown principal minor of measure $1/\mathrm{polylog}(\mathrm{dimension})$ is adversarially corrupted. Ours are the first robust guarantees for many of these problems. Further, our results offer an interesting counterpoint to strong lower bounds against less constrained SDP relaxations for average-case max-cut-gain (a.k.a. "optimizing the Sherrington-Kirkpatrick Hamiltonian") and other problems.
    SE-shapelets: Semi-supervised Clustering of Time Series Using Representative Shapelets. (arXiv:2304.03292v2 [cs.LG] UPDATED)
    Shapelets that discriminate time series using local features (subsequences) are promising for time series clustering. Existing time series clustering methods may fail to capture representative shapelets because they discover shapelets from a large pool of uninformative subsequences, and thus result in low clustering accuracy. This paper proposes a Semi-supervised Clustering of Time Series Using Representative Shapelets (SE-Shapelets) method, which utilizes a small number of labeled and propagated pseudo-labeled time series to help discover representative shapelets, thereby improving the clustering accuracy. In SE-Shapelets, we propose two techniques to discover representative shapelets for the effective clustering of time series. 1) A \textit{salient subsequence chain} ($SSC$) that can extract salient subsequences (as candidate shapelets) of a labeled/pseudo-labeled time series, which helps remove massive uninformative subsequences from the pool. 2) A \textit{linear discriminant selection} ($LDS$) algorithm to identify shapelets that can capture representative local features of time series in different classes, for convenient clustering. Experiments on UCR time series datasets demonstrate that SE-shapelets discovers representative shapelets and achieves higher clustering accuracy than counterpart semi-supervised time series clustering methods.
    k-Parameter Approach for False In-Season Anomaly Suppression in Daily Time Series Anomaly Detection. (arXiv:2311.08422v1 [cs.LG])
    Detecting anomalies in a daily time series with a weekly pattern is a common task with a wide range of applications. A typical way of performing the task is by using decomposition method. However, the method often generates false positive results where a data point falls within its weekly range but is just off from its weekday position. We refer to this type of anomalies as "in-season anomalies", and propose a k-parameter approach to address the issue. The approach provides configurable extra tolerance for in-season anomalies to suppress misleading alerts while preserving real positives. It yields favorable result.
    Beyond Vectors: Subspace Representations for Set Operations of Embeddings. (arXiv:2210.13034v2 [cs.CL] UPDATED)
    In natural language processing (NLP), the role of embeddings in representing linguistic semantics is crucial. Despite the prevalence of vector representations in embedding sets, they exhibit limitations in expressiveness and lack comprehensive set operations. To address this, we attempt to formulate and apply sets and their operations within pre-trained embedding spaces. Inspired by quantum logic, we propose to go beyond the conventional vector set representation with our novel subspace-based approach. This methodology constructs subspaces using pre-trained embedding sets, effectively preserving semantic nuances previously overlooked, and consequently consistently improving performance in downstream tasks.
    Unsupervised approaches based on optimal transport and convex analysis for inverse problems in imaging. (arXiv:2311.08972v1 [cs.CV])
    Unsupervised deep learning approaches have recently become one of the crucial research areas in imaging owing to their ability to learn expressive and powerful reconstruction operators even when paired high-quality training data is scarcely available. In this chapter, we review theoretically principled unsupervised learning schemes for solving imaging inverse problems, with a particular focus on methods rooted in optimal transport and convex analysis. We begin by reviewing the optimal transport-based unsupervised approaches such as the cycle-consistency-based models and learned adversarial regularization methods, which have clear probabilistic interpretations. Subsequently, we give an overview of a recent line of works on provably convergent learned optimization algorithms applied to accelerate the solution of imaging inverse problems, alongside their dedicated unsupervised training schemes. We also survey a number of provably convergent plug-and-play algorithms (based on gradient-step deep denoisers), which are among the most important and widely applied unsupervised approaches for imaging problems. At the end of this survey, we provide an overview of a few related unsupervised learning frameworks that complement our focused schemes. Together with a detailed survey, we provide an overview of the key mathematical results that underlie the methods reviewed in the chapter to keep our discussion self-contained.
    Model Agnostic Explainable Selective Regression via Uncertainty Estimation. (arXiv:2311.09145v1 [cs.LG])
    With the wide adoption of machine learning techniques, requirements have evolved beyond sheer high performance, often requiring models to be trustworthy. A common approach to increase the trustworthiness of such systems is to allow them to refrain from predicting. Such a framework is known as selective prediction. While selective prediction for classification tasks has been widely analyzed, the problem of selective regression is understudied. This paper presents a novel approach to selective regression that utilizes model-agnostic non-parametric uncertainty estimation. Our proposed framework showcases superior performance compared to state-of-the-art selective regressors, as demonstrated through comprehensive benchmarking on 69 datasets. Finally, we use explainable AI techniques to gain an understanding of the drivers behind selective regression. We implement our selective regression method in the open-source Python package doubt and release the code used to reproduce our experiments.
    Causal Discovery Under Local Privacy. (arXiv:2311.04037v2 [cs.CR] UPDATED)
    Differential privacy is a widely adopted framework designed to safeguard the sensitive information of data providers within a data set. It is based on the application of controlled noise at the interface between the server that stores and processes the data, and the data consumers. Local differential privacy is a variant that allows data providers to apply the privatization mechanism themselves on their data individually. Therefore it provides protection also in contexts in which the server, or even the data collector, cannot be trusted. The introduction of noise, however, inevitably affects the utility of the data, particularly by distorting the correlations between individual data components. This distortion can prove detrimental to tasks such as causal discovery. In this paper, we consider various well-known locally differentially private mechanisms and compare the trade-off between the privacy they provide, and the accuracy of the causal structure produced by algorithms for causal learning when applied to data obfuscated by these mechanisms. Our analysis yields valuable insights for selecting appropriate local differentially private protocols for causal discovery tasks. We foresee that our findings will aid researchers and practitioners in conducting locally private causal discovery.
    URANUS: Radio Frequency Tracking, Classification and Identification of Unmanned Aircraft Vehicles. (arXiv:2207.06025v3 [cs.LG] UPDATED)
    Safety and security issues for Critical Infrastructures are growing as attackers adopt drones as an attack vector flying in sensitive airspaces, such as airports, military bases, city centers, and crowded places. Despite the use of UAVs for logistics, shipping recreation activities, and commercial applications, their usage poses severe concerns to operators due to the violations and the invasions of the restricted airspaces. A cost-effective and real-time framework is needed to detect the presence of drones in such cases. In this contribution, we propose an efficient radio frequency-based detection framework called URANUS. We leverage real-time data provided by the Radio Frequency/Direction Finding system, and radars in order to detect, classify and identify drones (multi-copter and fixed-wings) invading no-drone zones. We adopt a Multilayer Perceptron neural network to identify and classify UAVs in real-time, with $90$% accuracy. For the tracking task, we use a Random Forest model to predict the position of a drone with an MSE $\approx0.29$, MAE $\approx0.04$, and $R^2\approx 0.93$. Furthermore, coordinate regression is performed using Universal Transverse Mercator coordinates to ensure high accuracy. Our analysis shows that URANUS is an ideal framework for identifying, classifying, and tracking UAVs that most Critical Infrastructure operators can adopt.
    The Bias Amplification Paradox in Text-to-Image Generation. (arXiv:2308.00755v2 [cs.LG] UPDATED)
    Bias amplification is a phenomenon in which models exacerbate biases or stereotypes present in the training data. In this paper, we study bias amplification in the text-to-image domain using Stable Diffusion by comparing gender ratios in training vs. generated images. We find that the model appears to amplify gender-occupation biases found in the training data (LAION) considerably. However, we discover that amplification can be largely attributed to discrepancies between training captions and model prompts. For example, an inherent difference is that captions from the training data often contain explicit gender information while our prompts do not, which leads to a distribution shift and consequently inflates bias measures. Once we account for distributional differences between texts used for training and generation when evaluating amplification, we observe that amplification decreases drastically. Our findings illustrate the challenges of comparing biases in models and their training data, and highlight confounding factors that impact analyses.
    Does Pre-trained Language Model Actually Infer Unseen Links in Knowledge Graph Completion?. (arXiv:2311.09109v1 [cs.CL])
    Knowledge graphs (KGs) consist of links that describe relationships between entities. Due to the difficulty of manually enumerating all relationships between entities, automatically completing them is essential for KGs. Knowledge Graph Completion (KGC) is a task that infers unseen relationships between entities in a KG. Traditional embedding-based KGC methods, such as RESCAL, TransE, DistMult, ComplEx, RotatE, HAKE, HousE, etc., infer missing links using only the knowledge from training data. In contrast, the recent Pre-trained Language Model (PLM)-based KGC utilizes knowledge obtained during pre-training. Therefore, PLM-based KGC can estimate missing links between entities by reusing memorized knowledge from pre-training without inference. This approach is problematic because building KGC models aims to infer unseen links between entities. However, conventional evaluations in KGC do not consider inference and memorization abilities separately. Thus, a PLM-based KGC method, which achieves high performance in current KGC evaluations, may be ineffective in practical applications. To address this issue, we analyze whether PLM-based KGC methods make inferences or merely access memorized knowledge. For this purpose, we propose a method for constructing synthetic datasets specified in this analysis and conclude that PLMs acquire the inference abilities required for KGC through pre-training, even though the performance improvements mostly come from textual information of entities and relations.
    AdCraft: An Advanced Reinforcement Learning Benchmark Environment for Search Engine Marketing Optimization. (arXiv:2306.11971v3 [cs.LG] UPDATED)
    We introduce AdCraft, a novel benchmark environment for the Reinforcement Learning (RL) community distinguished by its stochastic and non-stationary properties. The environment simulates bidding and budgeting dynamics within Search Engine Marketing (SEM), a digital marketing technique utilizing paid advertising to enhance the visibility of websites on search engine results pages (SERPs). The performance of SEM advertisement campaigns depends on several factors, including keyword selection, ad design, bid management, budget adjustments, and performance monitoring. Deep RL recently emerged as a potential strategy to optimize campaign profitability within the complex and dynamic landscape of SEM, but it requires substantial data, which may be costly or infeasible to acquire in practice. Our customizable environment enables practitioners to assess and enhance the robustness of RL algorithms pertinent to SEM bid and budget management without such costs. Through a series of experiments within the environment, we demonstrate the challenges imposed on agent convergence and performance by sparsity and non-stationarity. We hope these challenges further encourage discourse and development around effective strategies for managing real-world uncertainties.
    Gradient Masked Averaging for Federated Learning. (arXiv:2201.11986v2 [cs.LG] UPDATED)
    Federated learning (FL) is an emerging paradigm that permits a large number of clients with heterogeneous data to coordinate learning of a unified global model without the need to share data amongst each other. A major challenge in federated learning is the heterogeneity of data across client, which can degrade the performance of standard FL algorithms. Standard FL algorithms involve averaging of model parameters or gradient updates to approximate the global model at the server. However, we argue that in heterogeneous settings, averaging can result in information loss and lead to poor generalization due to the bias induced by dominant client gradients. We hypothesize that to generalize better across non-i.i.d datasets, the algorithms should focus on learning the invariant mechanism that is constant while ignoring spurious mechanisms that differ across clients. Inspired from recent works in Out-of-Distribution generalization, we propose a gradient masked averaging approach for FL as an alternative to the standard averaging of client updates. This aggregation technique for client updates can be adapted as a drop-in replacement in most existing federated algorithms. We perform extensive experiments on multiple FL algorithms with in-distribution, real-world, feature-skewed out-of-distribution, and quantity imbalanced datasets and show that it provides consistent improvements, particularly in the case of heterogeneous clients.
    Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization. (arXiv:2311.09184v1 [cs.CL])
    While large language models (LLMs) already achieve strong performance on standard generic summarization benchmarks, their performance on more complex summarization task settings is less studied. Therefore, we benchmark LLMs on instruction controllable text summarization, where the model input consists of both a source article and a natural language requirement for the desired summary characteristics. To this end, we curate an evaluation-only dataset for this task setting and conduct human evaluation on 5 LLM-based summarization systems. We then benchmark LLM-based automatic evaluation for this task with 4 different evaluation protocols and 11 LLMs, resulting in 40 evaluation methods in total. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs, since (1) all LLMs evaluated still make factual and other types of errors in their summaries; (2) all LLM-based evaluation methods cannot achieve a strong alignment with human annotators when judging the quality of candidate summaries; (3) different LLMs show large performance gaps in summary generation and evaluation. We make our collected benchmark, InstruSum, publicly available to facilitate future research in this direction.
    Tunable Quantum Neural Networks in the QPAC-Learning Framework. (arXiv:2205.01514v4 [quant-ph] CROSS LISTED)
    In this paper, we investigate the performances of tunable quantum neural networks in the Quantum Probably Approximately Correct (QPAC) learning framework. Tunable neural networks are quantum circuits made of multi-controlled X gates. By tuning the set of controls these circuits are able to approximate any Boolean functions. This architecture is particularly suited to be used in the QPAC-learning framework as it can handle the superposition produced by the oracle. In order to tune the network so that it can approximate a target concept, we have devised and implemented an algorithm based on amplitude amplification. The numerical results show that this approach can efficiently learn concepts from a simple class.
    GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer. (arXiv:2311.08526v1 [cs.CL])
    Named Entity Recognition (NER) is essential in various Natural Language Processing (NLP) applications. Traditional NER models are effective but limited to a set of predefined entity types. In contrast, Large Language Models (LLMs) can extract arbitrary entities through natural language instructions, offering greater flexibility. However, their size and cost, particularly for those accessed via APIs like ChatGPT, make them impractical in resource-limited scenarios. In this paper, we introduce a compact NER model trained to identify any type of entity. Leveraging a bidirectional transformer encoder, our model, GLiNER, facilitates parallel entity extraction, an advantage over the slow sequential token generation of LLMs. Through comprehensive testing, GLiNER demonstrate strong performance, outperforming both ChatGPT and fine-tuned LLMs in zero-shot evaluations on various NER benchmarks.  ( 2 min )
    Method for Text Entity Linking in Power Distribution Scheduling Oriented to Power Distribution Network Knowledge Graph. (arXiv:2311.08724v1 [cs.CL])
    The proposed method for linking entities in power distribution dispatch texts to a power distribution network knowledge graph is based on a deep understanding of these networks. This method leverages the unique features of entities in both the power distribution network's knowledge graph and the dispatch texts, focusing on their semantic, phonetic, and syntactic characteristics. An enhanced model, the Lexical Semantic Feature-based Skip Convolutional Neural Network (LSF-SCNN), is utilized for effectively matching dispatch text entities with those in the knowledge graph. The efficacy of this model, compared to a control model, is evaluated through cross-validation methods in real-world power distribution dispatch scenarios. The results indicate that the LSF-SCNN model excels in accurately linking a variety of entity types, demonstrating high overall accuracy in entity linking when the process is conducted in English.
    Leveraging Transformers to Improve Breast Cancer Classification and Risk Assessment with Multi-modal and Longitudinal Data. (arXiv:2311.03217v2 [eess.IV] UPDATED)
    Breast cancer screening, primarily conducted through mammography, is often supplemented with ultrasound for women with dense breast tissue. However, existing deep learning models analyze each modality independently, missing opportunities to integrate information across imaging modalities and time. In this study, we present Multi-modal Transformer (MMT), a neural network that utilizes mammography and ultrasound synergistically, to identify patients who currently have cancer and estimate the risk of future cancer for patients who are currently cancer-free. MMT aggregates multi-modal data through self-attention and tracks temporal tissue changes by comparing current exams to prior imaging. Trained on 1.3 million exams, MMT achieves an AUROC of 0.943 in detecting existing cancers, surpassing strong uni-modal baselines. For 5-year risk prediction, MMT attains an AUROC of 0.826, outperforming prior mammography-based risk models. Our research highlights the value of multi-modal and longitudinal imaging in cancer diagnosis and risk stratification.
    AdaptNet: Policy Adaptation for Physics-Based Character Control. (arXiv:2310.00239v3 [cs.GR] UPDATED)
    Motivated by humans' ability to adapt skills in the learning of new ones, this paper presents AdaptNet, an approach for modifying the latent space of existing policies to allow new behaviors to be quickly learned from like tasks in comparison to learning from scratch. Building on top of a given reinforcement learning controller, AdaptNet uses a two-tier hierarchy that augments the original state embedding to support modest changes in a behavior and further modifies the policy network layers to make more substantive changes. The technique is shown to be effective for adapting existing physics-based controllers to a wide range of new styles for locomotion, new task targets, changes in character morphology and extensive changes in environment. Furthermore, it exhibits significant increase in learning efficiency, as indicated by greatly reduced training times when compared to training from scratch or using other approaches that modify existing policies. Code is available at https://motion-lab.github.io/AdaptNet.
    ThoraX-PriorNet: A Novel Attention-Based Architecture Using Anatomical Prior Probability Maps for Thoracic Disease Classification. (arXiv:2210.02998v2 [eess.IV] UPDATED)
    Objective: Computer-aided disease diagnosis and prognosis based on medical images is a rapidly emerging field. Many Convolutional Neural Network (CNN) architectures have been developed by researchers for disease classification and localization from chest X-ray images. It is known that different thoracic disease lesions are more likely to occur in specific anatomical regions compared to others. This article aims to incorporate this disease and region-dependent prior probability distribution within a deep learning framework. Methods: We present the ThoraX-PriorNet, a novel attention-based CNN model for thoracic disease classification. We first estimate a disease-dependent spatial probability, i.e., an anatomical prior, that indicates the probability of occurrence of a disease in a specific region in a chest X-ray image. Next, we develop a novel attention-based classification model that combines information from the estimated anatomical prior and automatically extracted chest region of interest (ROI) masks to provide attention to the feature maps generated from a deep convolution network. Unlike previous works that utilize various self-attention mechanisms, the proposed method leverages the extracted chest ROI masks along with the probabilistic anatomical prior information, which selects the region of interest for different diseases to provide attention. Results: The proposed method shows superior performance in disease classification on the NIH ChestX-ray14 dataset compared to existing state-of-the-art methods while reaching an area under the ROC curve (%AUC) of 84.67. Regarding disease localization, the anatomy prior attention method shows competitive performance compared to state-of-the-art methods, achieving an accuracy of 0.80, 0.63, 0.49, 0.33, 0.28, 0.21, and 0.04 with an Intersection over Union (IoU) threshold of 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, and 0.7, respectively.
    Towards Label Embedding -- Measuring classification difficulty. (arXiv:2311.08874v1 [cs.LG])
    Uncertainty quantification in machine learning is a timely and vast field of research. In supervised learning, uncertainty can already occur in the very first stage of the training process, the labelling step. In particular, this is the case when not every instance can be unambiguously classified. The problem occurs for classifying instances, where classes may overlap or instances can not be clearly categorised. In other words, there is inevitable ambiguity in the annotation step and not necessarily a 'ground truth'. We look exemplary at the classification of satellite images. Each image is annotated independently by multiple labellers and classified into local climate zones (LCZs). For each instance we have multiple votes, leading to a distribution of labels rather than a single value. The main idea of this work is that we do not assume a ground truth label but embed the votes into a K-dimensional space, with K as the number of possible categories. The embedding is derived from the voting distribution in a Bayesian setup, modelled via a Dirichlet-Multinomial model. We estimate the model and posteriors using a stochastic Expectation Maximisation algorithm with Markov Chain Monte Carlo steps. While we focus on the particular example of LCZ classification, the methods developed in this paper readily extend to other situations where multiple annotators independently label texts or images. We also apply our approach to two other benchmark datasets for image classification to demonstrate this. Besides the embeddings themselves, we can investigate the resulting correlation matrices, which can be seen as generalised confusion matrices and reflect the semantic similarities of the original classes very well for all three exemplary datasets. The insights gained are valuable and can serve as general label embedding if a single ground truth per observation cannot be guaranteed.
    Semi-Supervised Learning in the Few-Shot Zero-Shot Scenario. (arXiv:2308.14119v2 [cs.CV] UPDATED)
    Semi-Supervised Learning (SSL) is a framework that utilizes both labeled and unlabeled data to enhance model performance. Conventional SSL methods operate under the assumption that labeled and unlabeled data share the same label space. However, in practical real-world scenarios, especially when the labeled training dataset is limited in size, some classes may be totally absent from the labeled set. To address this broader context, we propose a general approach to augment existing SSL methods, enabling them to effectively handle situations where certain classes are missing. This is achieved by introducing an additional term into their objective function, which penalizes the KL-divergence between the probability vectors of the true class frequencies and the inferred class frequencies. Our experimental results reveal significant improvements in accuracy when compared to state-of-the-art SSL, open-set SSL, and open-world SSL methods. We conducted these experiments on two benchmark image classification datasets, CIFAR-100 and STL-10, with the most remarkable improvements observed when the labeled data is severely limited, with only a few labeled examples per class
    Fast Detection of Phase Transitions with Multi-Task Learning-by-Confusion. (arXiv:2311.09128v1 [cs.LG])
    Machine learning has been successfully used to study phase transitions. One of the most popular approaches to identifying critical points from data without prior knowledge of the underlying phases is the learning-by-confusion scheme. As input, it requires system samples drawn from a grid of the parameter whose change is associated with potential phase transitions. Up to now, the scheme required training a distinct binary classifier for each possible splitting of the grid into two sides, resulting in a computational cost that scales linearly with the number of grid points. In this work, we propose and showcase an alternative implementation that only requires the training of a single multi-class classifier. Ideally, such multi-task learning eliminates the scaling with respect to the number of grid points. In applications to the Ising model and an image dataset generated with Stable Diffusion, we find significant speedups that closely correspond to the ideal case, with only minor deviations.
    Self-Supervised Curriculum Generation for Autonomous Reinforcement Learning without Task-Specific Knowledge. (arXiv:2311.09195v1 [cs.LG])
    A significant bottleneck in applying current reinforcement learning algorithms to real-world scenarios is the need to reset the environment between every episode. This reset process demands substantial human intervention, making it difficult for the agent to learn continuously and autonomously. Several recent works have introduced autonomous reinforcement learning (ARL) algorithms that generate curricula for jointly training reset and forward policies. While their curricula can reduce the number of required manual resets by taking into account the agent's learning progress, they rely on task-specific knowledge, such as predefined initial states or reset reward functions. In this paper, we propose a novel ARL algorithm that can generate a curriculum adaptive to the agent's learning progress without task-specific knowledge. Our curriculum empowers the agent to autonomously reset to diverse and informative initial states. To achieve this, we introduce a success discriminator that estimates the success probability from each initial state when the agent follows the forward policy. The success discriminator is trained with relabeled transitions in a self-supervised manner. Our experimental results demonstrate that our ARL algorithm can generate an adaptive curriculum and enable the agent to efficiently bootstrap to solve sparse-reward maze navigation tasks, outperforming baselines with significantly fewer manual resets.
    Scalable Federated Learning for Clients with Different Input Image Sizes and Numbers of Output Categories. (arXiv:2311.08716v1 [cs.LG])
    Federated learning is a privacy-preserving training method which consists of training from a plurality of clients but without sharing their confidential data. However, previous work on federated learning do not explore suitable neural network architectures for clients with different input images sizes and different numbers of output categories. In this paper, we propose an effective federated learning method named ScalableFL, where the depths and widths of the local models for each client are adjusted according to the clients' input image size and the numbers of output categories. In addition, we provide a new bound for the generalization gap of federated learning. In particular, this bound helps to explain the effectiveness of our scalable neural network approach. We demonstrate the effectiveness of ScalableFL in several heterogeneous client settings for both image classification and object detection tasks.
    Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization with Optimal Noise Scheduling. (arXiv:2311.08745v1 [cs.LG])
    The graduated optimization approach is a heuristic method for finding globally optimal solutions for nonconvex functions and has been theoretically analyzed in several studies. This paper defines a new family of nonconvex functions for graduated optimization, discusses their sufficient conditions, and provides a convergence analysis of the graduated optimization algorithm for them. It shows that stochastic gradient descent (SGD) with mini-batch stochastic gradients has the effect of smoothing the function, the degree of which is determined by the learning rate and batch size. This finding provides theoretical insights from a graduated optimization perspective on why large batch sizes fall into sharp local minima, why decaying learning rates and increasing batch sizes are superior to fixed learning rates and batch sizes, and what the optimal learning rate scheduling is. To the best of our knowledge, this is the first paper to provide a theoretical explanation for these aspects. Moreover, a new graduated optimization framework that uses a decaying learning rate and increasing batch size is analyzed and experimental results of image classification that support our theoretical findings are reported.
    Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit. (arXiv:2308.10238v3 [cs.LG] UPDATED)
    We study the real-valued combinatorial pure exploration of the multi-armed bandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given $d$ stochastic arms, and the reward of each arm $s\in\{1, \ldots, d\}$ follows an unknown distribution with mean $\mu_s$. In each time step, a player pulls a single arm and observes its reward. The player's goal is to identify the optimal \emph{action} $\boldsymbol{\pi}^{*} = \argmax_{\boldsymbol{\pi} \in \mathcal{A}} \boldsymbol{\mu}^{\top}\boldsymbol{\pi}$ from a finite-sized real-valued \emph{action set} $\mathcal{A}\subset \mathbb{R}^{d}$ with as few arm pulls as possible. Previous methods in the R-CPE-MAB assume that the size of the action set $\mathcal{A}$ is polynomial in $d$. We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large in $d$. We also introduce a novel problem-dependent sample complexity lower bound of the R-CPE-MAB problem, and show that the GenTS-Explore algorithm achieves the optimal sample complexity up to a problem-dependent constant factor.
    Reward is not Necessary: How to Create a Modular & Compositional Self-Preserving Agent for Life-Long Learning. (arXiv:2211.10851v4 [cs.AI] UPDATED)
    Reinforcement Learning views the maximization of rewards and avoidance of punishments as central to explaining goal-directed behavior. However, over a life, organisms will need to learn about many different aspects of the world's structure: the states of the world and state-vector transition dynamics. The number of combinations of states grows exponentially as an agent incorporates new knowledge, and there is no obvious weighted combination of pre-existing rewards or costs defined for a given combination of states, as such a weighting would need to encode information about good and bad combinations prior to an agent's experience in the world. Therefore, we must develop more naturalistic accounts of behavior and motivation in large state-spaces. We show that it is possible to use only the intrinsic motivation metric of empowerment, which measures the agent's capacity to realize many possible futures under a transition operator. We propose to scale empowerment to hierarchical state-spaces by using Operator Bellman Equations. These equations produce state-time feasibility functions, which are compositional hierarchical state-time transition operators that map an initial state and time when an agent begins a policy to the final states and times of completing a goal. Because these functions are hierarchical operators we can define hierarchical empowerment measures on them. An agent can then optimize plans to distant states and times to maximize its hierarchical empowerment-gain, allowing it to discover goals that bring about a more favorable coupling of its internal structure (physiological states) to its external environment (world structure & spatial state). Life-long agents could therefore be primarily animated by principles of compositionality and empowerment, exhibiting self-concern for the growth & maintenance of their own structural integrity without recourse to reward-maximization.
    DLAS: An Exploration and Assessment of the Deep Learning Acceleration Stack. (arXiv:2311.08909v1 [cs.LG])
    Deep Neural Networks (DNNs) are extremely computationally demanding, which presents a large barrier to their deployment on resource-constrained devices. Since such devices are where many emerging deep learning applications lie (e.g., drones, vision-based medical technology), significant bodies of work from both the machine learning and systems communities have attempted to provide optimizations to accelerate DNNs. To help unify these two perspectives, in this paper we combine machine learning and systems techniques within the Deep Learning Acceleration Stack (DLAS), and demonstrate how these layers can be tightly dependent on each other with an across-stack perturbation study. We evaluate the impact on accuracy and inference time when varying different parameters of DLAS across two datasets, seven popular DNN architectures, four DNN compression techniques, three algorithmic primitives with sparse and dense variants, untuned and auto-scheduled code generation, and four hardware platforms. Our evaluation highlights how perturbations across DLAS parameters can cause significant variation and across-stack interactions. The highest level observation from our evaluation is that the model size, accuracy, and inference time are not guaranteed to be correlated. Overall we make 13 key observations, including that speedups provided by compression techniques are very hardware dependent, and that compiler auto-tuning can significantly alter what the best algorithm to use for a given configuration is. With DLAS, we aim to provide a reference framework to aid machine learning and systems practitioners in reasoning about the context in which their respective DNN acceleration solutions exist in. With our evaluation strongly motivating the need for co-design, we believe that DLAS can be a valuable concept for exploring the next generation of co-designed accelerated deep learning solutions.
    Imagine the Unseen World: A Benchmark for Systematic Generalization in Visual World Models. (arXiv:2311.09064v1 [cs.CV])
    Systematic compositionality, or the ability to adapt to novel situations by creating a mental model of the world using reusable pieces of knowledge, remains a significant challenge in machine learning. While there has been considerable progress in the language domain, efforts towards systematic visual imagination, or envisioning the dynamical implications of a visual observation, are in their infancy. We introduce the Systematic Visual Imagination Benchmark (SVIB), the first benchmark designed to address this problem head-on. SVIB offers a novel framework for a minimal world modeling problem, where models are evaluated based on their ability to generate one-step image-to-image transformations under a latent world dynamics. The framework provides benefits such as the possibility to jointly optimize for systematic perception and imagination, a range of difficulty levels, and the ability to control the fraction of possible factor combinations used during training. We provide a comprehensive evaluation of various baseline models on SVIB, offering insight into the current state-of-the-art in systematic visual imagination. We hope that this benchmark will help advance visual systematic compositionality.
    Low-light Pedestrian Detection in Visible and Infrared Image Feeds: Issues and Challenges. (arXiv:2311.08557v1 [cs.CV])
    Pedestrian detection has become a cornerstone for several high-level tasks, including autonomous driving, intelligent transportation, and traffic surveillance. There are several works focussed on pedestrian detection using visible images, mainly in the daytime. However, this task is very intriguing when the environmental conditions change to poor lighting or nighttime. Recently, new ideas have been spurred to use alternative sources, such as Far InfraRed (FIR) temperature sensor feeds for detecting pedestrians in low-light conditions. This study comprehensively reviews recent developments in low-light pedestrian detection approaches. It systematically categorizes and analyses various algorithms from region-based to non-region-based and graph-based learning methodologies by highlighting their methodologies, implementation issues, and challenges. It also outlines the key benchmark datasets that can be used for research and development of advanced pedestrian detection algorithms, particularly in low-light situations
    Variational Temporal IRT: Fast, Accurate, and Explainable Inference of Dynamic Learner Proficiency. (arXiv:2311.08594v1 [cs.LG])
    Dynamic Item Response Models extend the standard Item Response Theory (IRT) to capture temporal dynamics in learner ability. While these models have the potential to allow instructional systems to actively monitor the evolution of learner proficiency in real time, existing dynamic item response models rely on expensive inference algorithms that scale poorly to massive datasets. In this work, we propose Variational Temporal IRT (VTIRT) for fast and accurate inference of dynamic learner proficiency. VTIRT offers orders of magnitude speedup in inference runtime while still providing accurate inference. Moreover, the proposed algorithm is intrinsically interpretable by virtue of its modular design. When applied to 9 real student datasets, VTIRT consistently yields improvements in predicting future learner performance over other learner proficiency models.
    An Eye on Clinical BERT: Investigating Language Model Generalization for Diabetic Eye Disease Phenotyping. (arXiv:2311.08687v1 [cs.CL])
    Diabetic eye disease is a major cause of blindness worldwide. The ability to monitor relevant clinical trajectories and detect lapses in care is critical to managing the disease and preventing blindness. Alas, much of the information necessary to support these goals is found only in the free text of the electronic medical record. To fill this information gap, we introduce a system for extracting evidence from clinical text of 19 clinical concepts related to diabetic eye disease and inferring relevant attributes for each. In developing this ophthalmology phenotyping system, we are also afforded a unique opportunity to evaluate the effectiveness of clinical language models at adapting to new clinical domains. Across multiple training paradigms, we find that BERT language models pretrained on out-of-distribution clinical data offer no significant improvement over BERT language models pretrained on non-clinical data for our domain. Our study tempers recent claims that language models pretrained on clinical data are necessary for clinical NLP tasks and highlights the importance of not treating clinical language data as a single homogeneous domain.
    Automated Volume Corrected Mitotic Index Calculation Through Annotation-Free Deep Learning using Immunohistochemistry as Reference Standard. (arXiv:2311.08949v1 [eess.IV])
    The volume-corrected mitotic index (M/V-Index) was shown to provide prognostic value in invasive breast carcinomas. However, despite its prognostic significance, it is not established as the standard method for assessing aggressive biological behaviour, due to the high additional workload associated with determining the epithelial proportion. In this work, we show that using a deep learning pipeline solely trained with an annotation-free, immunohistochemistry-based approach, provides accurate estimations of epithelial segmentation in canine breast carcinomas. We compare our automatic framework with the manually annotated M/V-Index in a study with three board-certified pathologists. Our results indicate that the deep learning-based pipeline shows expert-level performance, while providing time efficiency and reproducibility.
    Supervised low-rank semi-nonnegative matrix factorization with frequency regularization for forecasting spatio-temporal data. (arXiv:2311.08636v1 [stat.ML])
    We propose a novel methodology for forecasting spatio-temporal data using supervised semi-nonnegative matrix factorization (SSNMF) with frequency regularization. Matrix factorization is employed to decompose spatio-temporal data into spatial and temporal components. To improve clarity in the temporal patterns, we introduce a nonnegativity constraint on the time domain along with regularization in the frequency domain. Specifically, regularization in the frequency domain involves selecting features in the frequency space, making an interpretation in the frequency domain more convenient. We propose two methods in the frequency domain: soft and hard regularizations, and provide convergence guarantees to first-order stationary points of the corresponding constrained optimization problem. While our primary motivation stems from geophysical data analysis based on GRACE (Gravity Recovery and Climate Experiment) data, our methodology has the potential for wider application. Consequently, when applying our methodology to GRACE data, we find that the results with the proposed methodology are comparable to previous research in the field of geophysical sciences but offer clearer interpretability.
    When Do Program-of-Thoughts Work for Reasoning?. (arXiv:2308.15452v4 [cs.CL] UPDATED)
    In the realm of embodied artificial intelligence, the reasoning capabilities of Large Language Models (LLMs) play a pivotal role. Although there are effective methods like program-of-thought prompting for LLMs which uses programming language to tackle complex reasoning tasks, the specific impact of code data on the improvement of reasoning capabilities remains under-explored. To address this gap, we propose complexity-impacted reasoning score (CIRS), which combines structural and logical attributes, to measure the correlation between code and reasoning abilities. Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity by considering the difficulty and the cyclomatic complexity. Through an empirical analysis, we find not all code data of complexity can be learned or understood by LLMs. Optimal level of complexity is critical to the improvement of reasoning abilities by program-aided prompting. Then we design an auto-synthesizing and stratifying algorithm, and apply it to instruction generation for mathematical reasoning and code data filtering for code generation tasks. Extensive results demonstrates the effectiveness of our proposed approach. Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.
    Damped Proximal Augmented Lagrangian Method for weakly-Convex Problems with Convex Constraints. (arXiv:2311.09065v1 [math.OC])
    We give a damped proximal augmented Lagrangian method (DPALM) for solving problems with a weakly-convex objective and convex linear/nonlinear constraints. Instead of taking a full stepsize, DPALM adopts a damped dual stepsize to ensure the boundedness of dual iterates. We show that DPALM can produce a (near) $\vareps$-KKT point within $O(\vareps^{-2})$ outer iterations if each DPALM subproblem is solved to a proper accuracy. In addition, we establish overall iteration complexity of DPALM when the objective is either a regularized smooth function or in a regularized compositional form. For the former case, DPALM achieves the complexity of $\widetilde{\mathcal{O}}\left(\varepsilon^{-2.5} \right)$ to produce an $\varepsilon$-KKT point by applying an accelerated proximal gradient (APG) method to each DPALM subproblem. For the latter case, the complexity of DPALM is $\widetilde{\mathcal{O}}\left(\varepsilon^{-3} \right)$ to produce a near $\varepsilon$-KKT point by using an APG to solve a Moreau-envelope smoothed version of each subproblem. Our outer iteration complexity and the overall complexity either generalize existing best ones from unconstrained or linear-constrained problems to convex-constrained ones, or improve over the best-known results on solving the same-structured problems. Furthermore, numerical experiments on linearly/quadratically constrained non-convex quadratic programs and linear-constrained robust nonlinear least squares are conducted to demonstrate the empirical efficiency of the proposed DPALM over several state-of-the art methods.
    Limitations of neural network training due to numerical instability of backpropagation. (arXiv:2210.00805v4 [cs.LG] UPDATED)
    We study the training of deep neural networks by gradient descent where floating-point arithmetic is used to compute the gradients. In this framework and under realistic assumptions, we demonstrate that it is highly unlikely to find ReLU neural networks that maintain, in the course of training with gradient descent, superlinearly many affine pieces with respect to their number of layers. In virtually all approximation theoretical arguments that yield high-order polynomial rates of approximation, sequences of ReLU neural networks with exponentially many affine pieces compared to their numbers of layers are used. As a consequence, we conclude that approximating sequences of ReLU neural networks resulting from gradient descent in practice differ substantially from theoretically constructed sequences. The assumptions and the theoretical results are compared to a numerical study, which yields concurring results.
    Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev and Besov Spaces. (arXiv:2211.14400v4 [stat.ML] UPDATED)
    Let $\Omega = [0,1]^d$ be the unit cube in $\mathbb{R}^d$. We study the problem of how efficiently, in terms of the number of parameters, deep neural networks with the ReLU activation function can approximate functions in the Sobolev spaces $W^s(L_q(\Omega))$ and Besov spaces $B^s_r(L_q(\Omega))$, with error measured in the $L_p(\Omega)$ norm. This problem is important when studying the application of neural networks in a variety of fields, including scientific computing and signal processing, and has previously been solved only when $p=q=\infty$. Our contribution is to provide a complete solution for all $1\leq p,q\leq \infty$ and $s > 0$ for which the corresponding Sobolev or Besov space compactly embeds into $L_p$. The key technical tool is a novel bit-extraction technique which gives an optimal encoding of sparse vectors. This enables us to obtain sharp upper bounds in the non-linear regime where $p > q$. We also provide a novel method for deriving $L_p$-approximation lower bounds based upon VC-dimension when $p < \infty$. Our results show that very deep ReLU networks significantly outperform classical methods of approximation in terms of the number of parameters, but that this comes at the cost of parameters which are not encodable.
    Purpose in the Machine: Do Traffic Simulators Produce Distributionally Equivalent Outcomes for Reinforcement Learning Applications?. (arXiv:2311.08429v1 [cs.LG])
    Traffic simulators are used to generate data for learning in intelligent transportation systems (ITSs). A key question is to what extent their modelling assumptions affect the capabilities of ITSs to adapt to various scenarios when deployed in the real world. This work focuses on two simulators commonly used to train reinforcement learning (RL) agents for traffic applications, CityFlow and SUMO. A controlled virtual experiment varying driver behavior and simulation scale finds evidence against distributional equivalence in RL-relevant measures from these simulators, with the root mean squared error and KL divergence being significantly greater than 0 for all assessed measures. While granular real-world validation generally remains infeasible, these findings suggest that traffic simulators are not a deus ex machina for RL training: understanding the impacts of inter-simulator differences is necessary to train and deploy RL-based ITSs.  ( 2 min )
    Rankitect: Ranking Architecture Search Battling World-class Engineers at Meta Scale. (arXiv:2311.08430v1 [cs.LG])
    Neural Architecture Search (NAS) has demonstrated its efficacy in computer vision and potential for ranking systems. However, prior work focused on academic problems, which are evaluated at small scale under well-controlled fixed baselines. In industry system, such as ranking system in Meta, it is unclear whether NAS algorithms from the literature can outperform production baselines because of: (1) scale - Meta ranking systems serve billions of users, (2) strong baselines - the baselines are production models optimized by hundreds to thousands of world-class engineers for years since the rise of deep learning, (3) dynamic baselines - engineers may have established new and stronger baselines during NAS search, and (4) efficiency - the search pipeline must yield results quickly in alignment with the productionization life cycle. In this paper, we present Rankitect, a NAS software framework for ranking systems at Meta. Rankitect seeks to build brand new architectures by composing low level building blocks from scratch. Rankitect implements and improves state-of-the-art (SOTA) NAS methods for comprehensive and fair comparison under the same search space, including sampling-based NAS, one-shot NAS, and Differentiable NAS (DNAS). We evaluate Rankitect by comparing to multiple production ranking models at Meta. We find that Rankitect can discover new models from scratch achieving competitive tradeoff between Normalized Entropy loss and FLOPs. When utilizing search space designed by engineers, Rankitect can generate better models than engineers, achieving positive offline evaluation and online A/B test at Meta scale.
    Leveraging Foundation Models to Improve Lightweight Clients in Federated Learning. (arXiv:2311.08479v1 [cs.LG])
    Federated Learning (FL) is a distributed training paradigm that enables clients scattered across the world to cooperatively learn a global model without divulging confidential data. However, FL faces a significant challenge in the form of heterogeneous data distributions among clients, which leads to a reduction in performance and robustness. A recent approach to mitigating the impact of heterogeneous data distributions is through the use of foundation models, which offer better performance at the cost of larger computational overheads and slower inference speeds. We introduce foundation model distillation to assist in the federated training of lightweight client models and increase their performance under heterogeneous data settings while keeping inference costs low. Our results show improvement in the global model performance on a balanced testing set, which contains rarely observed samples, even under extreme non-IID client data distributions. We conduct a thorough evaluation of our framework with different foundation model backbones on CIFAR10, with varying degrees of heterogeneous data distributions ranging from class-specific data partitions across clients to dirichlet data sampling, parameterized by values between 0.01 and 1.0.
    Multistage Collaborative Knowledge Distillation from Large Language Models. (arXiv:2311.08640v1 [cs.CL])
    We study semi-supervised sequence prediction tasks where labeled data are too scarce to effectively finetune a model and at the same time few-shot prompting of a large language model (LLM) has suboptimal performance. This happens when a task, such as parsing, is expensive to annotate and also unfamiliar to a pretrained LLM. In this paper, we present a discovery that student models distilled from a prompted LLM can often generalize better than their teacher on such tasks. Leveraging this finding, we propose a new distillation method, multistage collaborative knowledge distillation from an LLM (MCKD), for such tasks. MCKD first prompts an LLM using few-shot in-context learning to produce pseudolabels for unlabeled data. Then, at each stage of distillation, a pair of students are trained on disjoint partitions of the pseudolabeled data. Each student subsequently produces new and improved pseudolabels for the unseen partition to supervise the next round of student(s) with. We show the benefit of multistage cross-partition labeling on two constituency parsing tasks. On CRAFT biomedical parsing, 3-stage MCKD with 50 labeled examples matches the performance of supervised finetuning with 500 examples and outperforms the prompted LLM and vanilla KD by 7.5% and 3.7% parsing F1, respectively.
    Interpretable by Design: Wrapper Boxes Combine Neural Performance with Faithful Explanations. (arXiv:2311.08644v1 [cs.LG])
    Can we preserve the accuracy of neural models while also providing faithful explanations? We present wrapper boxes, a general approach to generate faithful, example-based explanations for model predictions while maintaining predictive performance. After training a neural model as usual, its learned feature representation is input to a classic, interpretable model to perform the actual prediction. This simple strategy is surprisingly effective, with results largely comparable to those of the original neural model, as shown across three large pre-trained language models, two datasets of varying scale, four classic models, and four evaluation metrics. Moreover, because these classic models are interpretable by design, the subset of training examples that determine classic model predictions can be shown directly to users.
    Physical Adversarial Examples for Multi-Camera Systems. (arXiv:2311.08539v1 [cs.CV])
    Neural networks build the foundation of several intelligent systems, which, however, are known to be easily fooled by adversarial examples. Recent advances made these attacks possible even in air-gapped scenarios, where the autonomous system observes its surroundings by, e.g., a camera. We extend these ideas in our research and evaluate the robustness of multi-camera setups against such physical adversarial examples. This scenario becomes ever more important with the rise in popularity of autonomous vehicles, which fuse the information of several cameras for their driving decision. While we find that multi-camera setups provide some robustness towards past attack methods, we see that this advantage reduces when optimizing on multiple perspectives at once. We propose a novel attack method that we call Transcender-MC, where we incorporate online 3D renderings and perspective projections in the training process. Moreover, we motivate that certain data augmentation techniques can facilitate the generation of successful adversarial examples even further. Transcender-MC is 11% more effective in successfully attacking multi-camera setups than state-of-the-art methods. Our findings offer valuable insights regarding the resilience of object detection in a setup with multiple cameras and motivate the need of developing adequate defense mechanisms against them.  ( 2 min )
    Multiple-Question Multiple-Answer Text-VQA. (arXiv:2311.08622v1 [cs.CV])
    We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models. The text-VQA task requires a model to answer a question by understanding multi-modal content: text (typically from OCR) and an associated image. To the best of our knowledge, almost all previous approaches for text-VQA process a single question and its associated content to predict a single answer. In order to answer multiple questions from the same image, each question and content are fed into the model multiple times. In contrast, our proposed MQMA approach takes multiple questions and content as input at the encoder and predicts multiple answers at the decoder in an auto-regressive manner at the same time. We make several novel architectural modifications to standard encoder-decoder transformers to support MQMA. We also propose a novel MQMA denoising pre-training task which is designed to teach the model to align and delineate multiple questions and content with associated answers. MQMA pre-trained model achieves state-of-the-art results on multiple text-VQA datasets, each with strong baselines. Specifically, on OCR-VQA (+2.5%), TextVQA (+1.4%), ST-VQA (+0.6%), DocVQA (+1.1%) absolute improvements over the previous state-of-the-art approaches.  ( 2 min )
    Federated Learning for Sparse Principal Component Analysis. (arXiv:2311.08677v1 [cs.LG])
    In the rapidly evolving realm of machine learning, algorithm effectiveness often faces limitations due to data quality and availability. Traditional approaches grapple with data sharing due to legal and privacy concerns. The federated learning framework addresses this challenge. Federated learning is a decentralized approach where model training occurs on client sides, preserving privacy by keeping data localized. Instead of sending raw data to a central server, only model updates are exchanged, enhancing data security. We apply this framework to Sparse Principal Component Analysis (SPCA) in this work. SPCA aims to attain sparse component loadings while maximizing data variance for improved interpretability. Beside the L1 norm regularization term in conventional SPCA, we add a smoothing function to facilitate gradient-based optimization methods. Moreover, in order to improve computational efficiency, we introduce a least squares approximation to original SPCA. This enables analytic solutions on the optimization processes, leading to substantial computational improvements. Within the federated framework, we formulate SPCA as a consensus optimization problem, which can be solved using the Alternating Direction Method of Multipliers (ADMM). Our extensive experiments involve both IID and non-IID random features across various data owners. Results on synthetic and public datasets affirm the efficacy of our federated SPCA approach.
    A Unified Approach for Comprehensive Analysis of Various Spectral and Tissue Doppler Echocardiography. (arXiv:2311.08439v1 [eess.IV])
    Doppler echocardiography offers critical insights into cardiac function and phases by quantifying blood flow velocities and evaluating myocardial motion. However, previous methods for automating Doppler analysis, ranging from initial signal processing techniques to advanced deep learning approaches, have been constrained by their reliance on electrocardiogram (ECG) data and their inability to process Doppler views collectively. We introduce a novel unified framework using a convolutional neural network for comprehensive analysis of spectral and tissue Doppler echocardiography images that combines automatic measurements and end-diastole (ED) detection into a singular method. The network automatically recognizes key features across various Doppler views, with novel Doppler shape embedding and anti-aliasing modules enhancing interpretation and ensuring consistent analysis. Empirical results indicate a consistent outperformance in performance metrics, including dice similarity coefficients (DSC) and intersection over union (IoU). The proposed framework demonstrates strong agreement with clinicians in Doppler automatic measurements and competitive performance in ED detection.  ( 2 min )
    MAP's not dead yet: Uncovering true language model modes by conditioning away degeneracy. (arXiv:2311.08817v1 [cs.CL])
    It has been widely observed that exact or approximate MAP (mode-seeking) decoding from natural language generation (NLG) models consistently leads to degenerate outputs (Stahlberg and Byrne, 2019, Holtzman et al., 2019). This has generally been attributed to either a fundamental inadequacy of modes in models or weaknesses in language modeling. Contrastingly in this work, we emphasize that degenerate modes can even occur in the absence of any model error, due to contamination of the training data. Specifically, we show that mixing even a tiny amount of low-entropy noise with a population text distribution can cause the data distribution's mode to become degenerate, implying that any models trained on it will be as well. As the unconditional mode of NLG models will often be degenerate, we therefore propose to apply MAP decoding to the model's distribution conditional on avoiding specific degeneracies. Using exact-search, we empirically verify that the length-conditional modes of machine translation models and language models are indeed more fluent and topical than their unconditional modes. For the first time, we also share many examples of exact modal sequences from these models, and from several variants of the LLaMA-7B model. Notably, the modes of the LLaMA models are still degenerate, showing that improvements in modeling have not fixed this issue. Because of the cost of exact mode finding algorithms, we develop an approximate mode finding approach, ACBS, which finds sequences that are both high-likelihood and high-quality. We apply this approach to LLaMA-7B, a model which was not trained for instruction following, and find that we are able to elicit reasonable outputs without any finetuning.
    Towards Verifiable Text Generation with Symbolic References. (arXiv:2311.09188v1 [cs.CL])
    Large language models (LLMs) have demonstrated an impressive ability to synthesize plausible and fluent text. However they remain vulnerable to hallucinations, and thus their outputs generally require manual human verification for high-stakes applications, which can be time-consuming and difficult. This paper proposes symbolically grounded generation (SymGen) as a simple approach for enabling easier validation of an LLM's output. SymGen prompts an LLM to interleave its regular output text with explicit symbolic references to fields present in some conditioning data (e.g., a table in JSON format). The references can be used to display the provenance of different spans of text in the generation, reducing the effort required for manual verification. Across data-to-text and question answering experiments, we find that LLMs are able to directly output text that makes use of symbolic references while maintaining fluency and accuracy.  ( 2 min )
    MADG: Margin-based Adversarial Learning for Domain Generalization. (arXiv:2311.08503v1 [cs.CV])
    Domain Generalization (DG) techniques have emerged as a popular approach to address the challenges of domain shift in Deep Learning (DL), with the goal of generalizing well to the target domain unseen during the training. In recent years, numerous methods have been proposed to address the DG setting, among which one popular approach is the adversarial learning-based methodology. The main idea behind adversarial DG methods is to learn domain-invariant features by minimizing a discrepancy metric. However, most adversarial DG methods use 0-1 loss based $\mathcal{H}\Delta\mathcal{H}$ divergence metric. In contrast, the margin loss-based discrepancy metric has the following advantages: more informative, tighter, practical, and efficiently optimizable. To mitigate this gap, this work proposes a novel adversarial learning DG algorithm, MADG, motivated by a margin loss-based discrepancy metric. The proposed MADG model learns domain-invariant features across all source domains and uses adversarial training to generalize well to the unseen target domain. We also provide a theoretical analysis of the proposed MADG model based on the unseen target error bound. Specifically, we construct the link between the source and unseen domains in the real-valued hypothesis space and derive the generalization bound using margin loss and Rademacher complexity. We extensively experiment with the MADG model on popular real-world DG datasets, VLCS, PACS, OfficeHome, DomainNet, and TerraIncognita. We evaluate the proposed algorithm on DomainBed's benchmark and observe consistent performance across all the datasets.
    Fixed-Budget Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit. (arXiv:2310.15681v2 [cs.LG] UPDATED)
    We study the real-valued combinatorial pure exploration of the multi-armed bandit in the fixed-budget setting. We first introduce the Combinatorial Successive Asign (CSA) algorithm, which is the first algorithm that can identify the best action even when the size of the action class is exponentially large with respect to the number of arms. We show that the upper bound of the probability of error of the CSA algorithm matches a lower bound up to a logarithmic factor in the exponent. Then, we introduce another algorithm named the Minimax Combinatorial Successive Accepts and Rejects (Minimax-CombSAR) algorithm for the case where the size of the action class is polynomial, and show that it is optimal, which matches a lower bound. Finally, we experimentally compare the algorithms with previous methods and show that our algorithm performs better.
    Toucan: Token-Aware Character Level Language Modeling. (arXiv:2311.08620v1 [cs.CL])
    Character-level language models obviate the need for separately trained tokenizers, but efficiency suffers from longer sequence lengths. Learning to combine character representations into tokens has made training these models more efficient, but they still require decoding characters individually. We propose Toucan, an augmentation to character-level models to make them "token-aware". Comparing our method to prior work, we demonstrate significant speed-ups in character generation without a loss in language modeling performance. We then explore differences between our learned dynamic tokenization of character sequences with popular fixed vocabulary solutions such as Byte-Pair Encoding and WordPiece, finding our approach leads to a greater amount of longer sequences tokenized as single items. Our project and code are available at https://nlp.jhu.edu/nuggets/.
    Schema-adaptable Knowledge Graph Construction. (arXiv:2305.08703v4 [cs.CL] UPDATED)
    Conventional Knowledge Graph Construction (KGC) approaches typically follow the static information extraction paradigm with a closed set of pre-defined schema. As a result, such approaches fall short when applied to dynamic scenarios or domains, whereas a new type of knowledge emerges. This necessitates a system that can handle evolving schema automatically to extract information for KGC. To address this need, we propose a new task called schema-adaptable KGC, which aims to continually extract entity, relation, and event based on a dynamically changing schema graph without re-training. We first split and convert existing datasets based on three principles to build a benchmark, i.e., horizontal schema expansion, vertical schema expansion, and hybrid schema expansion; then investigate the schema-adaptable performance of several well-known approaches such as Text2Event, TANL, UIE and GPT-3.5. We further propose a simple yet effective baseline dubbed \textsc{AdaKGC}, which contains schema-enriched prefix instructor and schema-conditioned dynamic decoding to better handle evolving schema. Comprehensive experimental results illustrate that AdaKGC can outperform baselines but still have room for improvement. We hope the proposed work can deliver benefits to the community. Code and datasets available at https://github.com/zjunlp/AdaKGC.
    Manifold learning in Wasserstein space. (arXiv:2311.08549v1 [stat.ML])
    This paper aims at building the theoretical foundations for manifold learning algorithms in the space of absolutely continuous probability measures on a compact and convex subset of $\mathbb{R}^d$, metrized with the Wasserstein-2 distance $W$. We begin by introducing a natural construction of submanifolds $\Lambda$ of probability measures equipped with metric $W_\Lambda$, the geodesic restriction of $W$ to $\Lambda$. In contrast to other constructions, these submanifolds are not necessarily flat, but still allow for local linearizations in a similar fashion to Riemannian submanifolds of $\mathbb{R}^d$. We then show how the latent manifold structure of $(\Lambda,W_{\Lambda})$ can be learned from samples $\{\lambda_i\}_{i=1}^N$ of $\Lambda$ and pairwise extrinsic Wasserstein distances $W$ only. In particular, we show that the metric space $(\Lambda,W_{\Lambda})$ can be asymptotically recovered in the sense of Gromov--Wasserstein from a graph with nodes $\{\lambda_i\}_{i=1}^N$ and edge weights $W(\lambda_i,\lambda_j)$. In addition, we demonstrate how the tangent space at a sample $\lambda$ can be asymptotically recovered via spectral analysis of a suitable "covariance operator" using optimal transport maps from $\lambda$ to sufficiently close and diverse samples $\{\lambda_i\}_{i=1}^N$. The paper closes with some explicit constructions of submanifolds $\Lambda$ and numerical examples on the recovery of tangent spaces through spectral analysis.  ( 2 min )
    Surrogate Modeling for Computationally Expensive Simulations of Supernovae in High-Resolution Galaxy Simulations. (arXiv:2311.08460v1 [astro-ph.GA])
    Some stars are known to explode at the end of their lives, called supernovae (SNe). The substantial amount of matter and energy that SNe release provides significant feedback to star formation and gas dynamics in a galaxy. SNe release a substantial amount of matter and energy to the interstellar medium, resulting in significant feedback to star formation and gas dynamics in a galaxy. While such feedback has a crucial role in galaxy formation and evolution, in simulations of galaxy formation, it has only been implemented using simple {\it sub-grid models} instead of numerically solving the evolution of gas elements around SNe in detail due to a lack of resolution. We develop a method combining machine learning and Gibbs sampling to predict how a supernova (SN) affects the surrounding gas. The fidelity of our model in the thermal energy and momentum distribution outperforms the low-resolution SN simulations. Our method can replace the SN sub-grid models and help properly simulate un-resolved SN feedback in galaxy formation simulations. We find that employing our new approach reduces the necessary computational cost to $\sim$ 1 percent compared to directly resolving SN feedback.  ( 2 min )
    Parameter-Efficient Multilingual Summarisation: An Empirical Study. (arXiv:2311.08572v1 [cs.CL])
    With the increasing prevalence of Large Language Models, traditional full fine-tuning approaches face growing challenges, especially in memory-intensive tasks. This paper investigates the potential of Parameter-Efficient Fine-Tuning, focusing on Low-Rank Adaptation (LoRA), for complex and under-explored multilingual summarisation tasks. We conduct an extensive study across different data availability scenarios, including full-data, low-data, and cross-lingual transfer, leveraging models of different sizes. Our findings reveal that LoRA lags behind full fine-tuning when trained with full data, however, it excels in low-data scenarios and cross-lingual transfer. Interestingly, as models scale up, the performance gap between LoRA and full fine-tuning diminishes. Additionally, we investigate effective strategies for few-shot cross-lingual transfer, finding that continued LoRA tuning achieves the best performance compared to both full fine-tuning and dynamic composition of language-specific LoRA modules.
    sQUlearn $\unicode{x2013}$ A Python Library for Quantum Machine Learning. (arXiv:2311.08990v1 [quant-ph])
    sQUlearn introduces a user-friendly, NISQ-ready Python library for quantum machine learning (QML), designed for seamless integration with classical machine learning tools like scikit-learn. The library's dual-layer architecture serves both QML researchers and practitioners, enabling efficient prototyping, experimentation, and pipelining. sQUlearn provides a comprehensive toolset that includes both quantum kernel methods and quantum neural networks, along with features like customizable data encoding strategies, automated execution handling, and specialized kernel regularization techniques. By focusing on NISQ-compatibility and end-to-end automation, sQUlearn aims to bridge the gap between current quantum computing capabilities and practical machine learning applications.
    LocaliseBot: Multi-view 3D object localisation with differentiable rendering for robot grasping. (arXiv:2311.08438v1 [cs.CV])
    Robot grasp typically follows five stages: object detection, object localisation, object pose estimation, grasp pose estimation, and grasp planning. We focus on object pose estimation. Our approach relies on three pieces of information: multiple views of the object, the camera's extrinsic parameters at those viewpoints, and 3D CAD models of objects. The first step involves a standard deep learning backbone (FCN ResNet) to estimate the object label, semantic segmentation, and a coarse estimate of the object pose with respect to the camera. Our novelty is using a refinement module that starts from the coarse pose estimate and refines it by optimisation through differentiable rendering. This is a purely vision-based approach that avoids the need for other information such as point cloud or depth images. We evaluate our object pose estimation approach on the ShapeNet dataset and show improvements over the state of the art. We also show that the estimated object pose results in 99.65% grasp accuracy with the ground truth grasp candidates on the Object Clutter Indoor Dataset (OCID) Grasp dataset, as computed using standard practice.  ( 2 min )
    Fast Sparse 3D Convolution Network with VDB. (arXiv:2311.02762v2 [cs.CV] UPDATED)
    We proposed a new Convolution Neural Network implementation optimized for sparse 3D data inference. This implementation uses NanoVDB as the data structure to store the sparse tensor. It leaves a relatively small memory footprint while maintaining high performance. We demonstrate that this architecture is around 20 times faster than the state-of-the-art dense CNN model on a high-resolution 3D object classification network.
    Probability of Collision of satellites and space debris for short-term encounters: Rederivation and fast-to-compute upper and lower bounds. (arXiv:2311.08978v1 [astro-ph.EP])
    The proliferation of space debris in LEO has become a major concern for the space industry. With the growing interest in space exploration, the prediction of potential collisions between objects in orbit has become a crucial issue. It is estimated that, in orbit, there are millions of fragments a few millimeters in size and thousands of inoperative satellites and discarded rocket stages. Given the high speeds that these fragments can reach, even fragments a few millimeters in size can cause fractures in a satellite's hull or put a serious crack in the window of a space shuttle. The conventional method proposed by Akella and Alfriend in 2000 remains widely used to estimate the probability of collision in short-term encounters. Given the small period of time, it is assumed that, during the encounter: (1) trajectories are represented by straight lines with constant velocity; (2) there is no velocity uncertainty and the position exhibits a stationary distribution throughout the encounter; and (3) position uncertainties are independent and represented by Gaussian distributions. This study introduces a novel derivation based on first principles that naturally allows for tight and fast upper and lower bounds for the probability of collision. We tested implementations of both probability and bound computations with the original and our formulation on a real CDM dataset used in ESA's Collision Avoidance Challenge. Our approach reduces the calculation of the probability to two one-dimensional integrals and has the potential to significantly reduce the processing time compared to the traditional method, from 80% to nearly real-time.
    HEALNet -- Hybrid Multi-Modal Fusion for Heterogeneous Biomedical Data. (arXiv:2311.09115v1 [cs.LG])
    Technological advances in medical data collection such as high-resolution histopathology and high-throughput genomic sequencing have contributed to the rising requirement for multi-modal biomedical modelling, specifically for image, tabular, and graph data. Most multi-modal deep learning approaches use modality-specific architectures that are trained separately and cannot capture the crucial cross-modal information that motivates the integration of different data sources. This paper presents the Hybrid Early-fusion Attention Learning Network (HEALNet): a flexible multi-modal fusion architecture, which a) preserves modality-specific structural information, b) captures the cross-modal interactions and structural information in a shared latent space, c) can effectively handle missing modalities during training and inference, and d) enables intuitive model inspection by learning on the raw data input instead of opaque embeddings. We conduct multi-modal survival analysis on Whole Slide Images and Multi-omic data on four cancer cohorts of The Cancer Genome Atlas (TCGA). HEALNet achieves state-of-the-art performance, substantially improving over both uni-modal and recent multi-modal baselines, whilst being robust in scenarios with missing modalities.
    On the Need and Applicability of Causality for Fair Machine Learning. (arXiv:2207.04053v3 [cs.LG] UPDATED)
    Besides its common use cases in epidemiology, political, and social sciences, causality turns out to be crucial in evaluating the fairness of automated decisions, both in a legal and everyday sense. We provide arguments and examples, of why causality is particularly important for fairness evaluation. In particular, we point out the social impact of non-causal predictions and the legal anti-discrimination process that relies on causal claims. We conclude with a discussion about the challenges and limitations of applying causality in practical scenarios as well as possible solutions.
    $\beta$-Variational autoencoders and transformers for reduced-order modelling of fluid flows. (arXiv:2304.03571v2 [physics.flu-dyn] UPDATED)
    Variational autoencoder (VAE) architectures have the potential to develop reduced-order models (ROMs) for chaotic fluid flows. We propose a method for learning compact and near-orthogonal ROMs using a combination of a $\beta$-VAE and a transformer, tested on numerical data from a two-dimensional viscous flow in both periodic and chaotic regimes. The $\beta$-VAE is trained to learn a compact latent representation of the flow velocity, and the transformer is trained to predict the temporal dynamics in latent space. Using the $\beta$-VAE to learn disentangled representations in latent-space, we obtain a more interpretable flow model with features that resemble those observed in the proper orthogonal decomposition, but with a more efficient representation. Using Poincar\'e maps, the results show that our method can capture the underlying dynamics of the flow outperforming other prediction models. The proposed method has potential applications in other fields such as weather forecasting, structural dynamics or biomedical engineering.
    Ever: Mitigating Hallucination in Large Language Models through Real-Time Verification and Rectification. (arXiv:2311.09114v1 [cs.CL])
    Large Language Models (LLMs) have demonstrated remarkable proficiency in generating fluent text. However, they often encounter the challenge of generating inaccurate or hallucinated content. This issue is common in both non-retrieval-based generation and retrieval-augmented generation approaches, and existing post-hoc rectification methods may not address the accumulated hallucination errors that may be caused by the "snowballing" issue, especially in reasoning tasks. To tackle these challenges, we introduce a novel approach called Real-time Verification and Rectification (Ever). Instead of waiting until the end of the generation process to rectify hallucinations, Ever employs a real-time, step-wise generation and hallucination rectification strategy. The primary objective is to detect and rectify hallucinations as they occur during the text generation process. When compared to both retrieval-based and non-retrieval-based baselines, Ever demonstrates a significant improvement in generating trustworthy and factually accurate text across a diverse range of tasks, including short-form QA, biography generation, and multi-hop reasoning.
    Supported Trust Region Optimization for Offline Reinforcement Learning. (arXiv:2311.08935v1 [cs.LG])
    Offline reinforcement learning suffers from the out-of-distribution issue and extrapolation error. Most policy constraint methods regularize the density of the trained policy towards the behavior policy, which is too restrictive in most cases. We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy, enjoying the less restrictive support constraint. We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset. Further with both errors incorporated, STR still guarantees safe policy improvement for each step. Empirical results validate the theory of STR and demonstrate its state-of-the-art performance on MuJoCo locomotion domains and much more challenging AntMaze domains.
    Efficiently Escaping Saddle Points for Non-Convex Policy Optimization. (arXiv:2311.08914v1 [cs.LG])
    Policy gradient (PG) is widely used in reinforcement learning due to its scalability and good performance. In recent years, several variance-reduced PG methods have been proposed with a theoretical guarantee of converging to an approximate first-order stationary point (FOSP) with the sample complexity of $O(\epsilon^{-3})$. However, FOSPs could be bad local optima or saddle points. Moreover, these algorithms often use importance sampling (IS) weights which could impair the statistical effectiveness of variance reduction. In this paper, we propose a variance-reduced second-order method that uses second-order information in the form of Hessian vector products (HVP) and converges to an approximate second-order stationary point (SOSP) with sample complexity of $\tilde{O}(\epsilon^{-3})$. This rate improves the best-known sample complexity for achieving approximate SOSPs by a factor of $O(\epsilon^{-0.5})$. Moreover, the proposed variance reduction technique bypasses IS weights by using HVP terms. Our experimental results show that the proposed algorithm outperforms the state of the art and is more robust to changes in random seeds.
    Editing Language Model-based Knowledge Graph Embeddings. (arXiv:2301.10405v6 [cs.CL] UPDATED)
    Recently decades have witnessed the empirical success of framing Knowledge Graph (KG) embeddings via language models. However, language model-based KG embeddings are usually deployed as static artifacts, making them difficult to modify post-deployment without re-training after deployment. To address this issue, we propose a new task of editing language model-based KG embeddings in this paper. This task is designed to facilitate rapid, data-efficient updates to KG embeddings without compromising the performance of other aspects. We build four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and evaluate several knowledge editing baselines demonstrating the limited ability of previous models to handle the proposed challenging task. We further propose a simple yet strong baseline dubbed KGEditor, which utilizes additional parametric layers of the hyper network to edit/add facts. Our comprehensive experimental results reveal that KGEditor excels in updating specific facts without impacting the overall performance, even when faced with limited training resources. Code and datasets are available in https://github.com/zjunlp/PromptKG/tree/main/deltaKG.
    Infrared Image Super-Resolution: Systematic Review, and Future Trends. (arXiv:2212.12322v2 [eess.IV] UPDATED)
    Image Super-Resolution (SR) is essential for a wide range of computer vision and image processing tasks. Investigating infrared (IR) image (or thermal images) super-resolution is a continuing concern within the development of deep learning. This survey aims to provide a comprehensive perspective of IR image super-resolution, including its applications, hardware imaging system dilemmas, and taxonomy of image processing methodologies. In addition, the datasets and evaluation metrics in IR image super-resolution tasks are also discussed. Furthermore, the deficiencies in current technologies and possible promising directions for the community to explore are highlighted. To cope with the rapid development in this field, we intend to regularly update the relevant excellent work at \url{https://github.com/yongsongH/Infrared_Image_SR_Survey
    One-Shot Federated Learning with Classifier-Guided Diffusion Models. (arXiv:2311.08870v1 [cs.CV])
    One-shot federated learning (OSFL) has gained attention in recent years due to its low communication cost. However, most of the existing methods require auxiliary datasets or training generators, which hinders their practicality in real-world scenarios. In this paper, we explore the novel opportunities that diffusion models bring to OSFL and propose FedCADO, utilizing guidance from client classifiers to generate data that complies with clients' distributions and subsequently training the aggregated model on the server. Specifically, our method involves targeted optimizations in two aspects. On one hand, we conditionally edit the randomly sampled initial noises, embedding them with specified semantics and distributions, resulting in a significant improvement in both the quality and stability of generation. On the other hand, we employ the BN statistics from the classifiers to provide detailed guidance during generation. These tailored optimizations enable us to limitlessly generate datasets, which closely resemble the distribution and quality of the original client dataset. Our method effectively handles the heterogeneous client models and the problems of non-IID features or labels. In terms of privacy protection, our method avoids training any generator or transferring any auxiliary information on clients, eliminating any additional privacy leakage risks. Leveraging the extensive knowledge stored in the pre-trained diffusion model, the synthetic datasets can assist us in surpassing the knowledge limitations of the client samples, resulting in aggregation models that even outperform the performance ceiling of centralized training in some cases, which is convincingly demonstrated in the sufficient quantification and visualization experiments conducted on three large-scale multi-domain image datasets.
    An Analysis of Model-Based Reinforcement Learning From Abstracted Observations. (arXiv:2208.14407v3 [cs.LG] UPDATED)
    Many methods for Model-based Reinforcement learning (MBRL) in Markov decision processes (MDPs) provide guarantees for both the accuracy of the model they can deliver and the learning efficiency. At the same time, state abstraction techniques allow for a reduction of the size of an MDP while maintaining a bounded loss with respect to the original problem. Therefore, it may come as a surprise that no such guarantees are available when combining both techniques, i.e., where MBRL merely observes abstract states. Our theoretical analysis shows that abstraction can introduce a dependence between samples collected online (e.g., in the real world). That means that, without taking this dependence into account, results for MBRL do not directly extend to this setting. Our result shows that we can use concentration inequalities for martingales to overcome this problem. This result makes it possible to extend the guarantees of existing MBRL algorithms to the setting with abstraction. We illustrate this by combining R-MAX, a prototypical MBRL algorithm, with abstraction, thus producing the first performance guarantees for model-based 'RL from Abstracted Observations': model-based reinforcement learning with an abstract model.
    CNTLS: A Benchmark Dataset for Abstractive or Extractive Chinese Timeline Summarization. (arXiv:2105.14201v2 [cs.AI] UPDATED)
    Timeline summarization (TLS) involves creating summaries of long-running events using dated summaries from numerous news articles. However, limited data availability has significantly slowed down the development of timeline summarization. In this paper, we introduce the CNTLS dataset, a versatile resource for Chinese timeline summarization. CNTLS encompasses 77 real-life topics, each with 2524 documents and summarizes nearly 60\% days duration compression on average all topics. We meticulously analyze the corpus using well-known metrics, focusing on the style of the summaries and the complexity of the summarization task. Specifically, we evaluate the performance of various extractive and generative summarization systems on the CNTLS corpus to provide benchmarks and support further research. To the best of our knowledge, CNTLS is the first Chinese timeline summarization dataset. The dataset and source code are released\footnote{Code and data available at: \emph{\url{https://github.com/OpenSUM/CNTLS}}.}.
    Unsupervised segmentation of irradiation$\unicode{x2010}$induced order$\unicode{x2010}$disorder phase transitions in electron microscopy. (arXiv:2311.08585v1 [cond-mat.mtrl-sci])
    We present a method for the unsupervised segmentation of electron microscopy images, which are powerful descriptors of materials and chemical systems. Images are oversegmented into overlapping chips, and similarity graphs are generated from embeddings extracted from a domain$\unicode{x2010}$pretrained convolutional neural network (CNN). The Louvain method for community detection is then applied to perform segmentation. The graph representation provides an intuitive way of presenting the relationship between chips and communities. We demonstrate our method to track irradiation$\unicode{x2010}$induced amorphous fronts in thin films used for catalysis and electronics. This method has potential for "on$\unicode{x2010}$the$\unicode{x2010}$fly" segmentation to guide emerging automated electron microscopes.
    Confident Naturalness Explanation (CNE): A Framework to Explain and Assess Patterns Forming Naturalness. (arXiv:2311.08936v1 [cs.LG])
    Protected natural areas are regions that have been minimally affected by human activities such as urbanization, agriculture, and other human interventions. To better understand and map the naturalness of these areas, machine learning models can be used to analyze satellite imagery. Specifically, explainable machine learning methods show promise in uncovering patterns that contribute to the concept of naturalness within these protected environments. Additionally, addressing the uncertainty inherent in machine learning models is crucial for a comprehensive understanding of this concept. However, existing approaches have limitations. They either fail to provide explanations that are both valid and objective or struggle to offer a quantitative metric that accurately measures the contribution of specific patterns to naturalness, along with the associated confidence. In this paper, we propose a novel framework called the Confident Naturalness Explanation (CNE) framework. This framework combines explainable machine learning and uncertainty quantification to assess and explain naturalness. We introduce a new quantitative metric that describes the confident contribution of patterns to the concept of naturalness. Furthermore, we generate an uncertainty-aware segmentation mask for each input sample, highlighting areas where the model lacks knowledge. To demonstrate the effectiveness of our framework, we apply it to a study site in Fennoscandia using two open-source satellite datasets.
    Cross-dataset domain adaptation for the classification COVID-19 using chest computed tomography images. (arXiv:2311.08524v1 [eess.IV])
    Detecting COVID-19 patients using Computed Tomography (CT) images of the lungs is an active area of research. Datasets of CT images from COVID-19 patients are becoming available. Deep learning (DL) solutions and in particular Convolutional Neural Networks (CNN) have achieved impressive results for the classification of COVID-19 CT images, but only when the training and testing take place within the same dataset. Work on the cross-dataset problem is still limited and the achieved results are low. Our work tackles the cross-dataset problem through a Domain Adaptation (DA) technique with deep learning. Our proposed solution, COVID19-DANet, is based on pre-trained CNN backbone for feature extraction. For this task, we select the pre-trained Efficientnet-B3 CNN because it has achieved impressive classification accuracy in previous work. The backbone CNN is followed by a prototypical layer which is a concept borrowed from prototypical networks in few-shot learning (FSL). It computes a cosine distance between given samples and the class prototypes and then converts them to class probabilities using the Softmax function. To train the COVID19-DANet model, we propose a combined loss function that is composed of the standard cross-entropy loss for class discrimination and another entropy loss computed over the unlabelled target set only. This so-called unlabelled target entropy loss is minimized and maximized in an alternative fashion, to reach the two objectives of class discrimination and domain invariance. COVID19-DANet is tested under four cross-dataset scenarios using the SARS-CoV-2-CT and COVID19-CT datasets and has achieved encouraging results compared to recent work in the literature.
    A Multimodal Dataset of 21,412 Recorded Nights for Sleep and Respiratory Research. (arXiv:2311.08979v1 [cs.LG])
    This study introduces a novel, rich dataset obtained from home sleep apnea tests using the FDA-approved WatchPAT-300 device, collected from 7,077 participants over 21,412 nights. The dataset comprises three levels of sleep data: raw multi-channel time-series from sensors, annotated sleep events, and computed summary statistics, which include 447 features related to sleep architecture, sleep apnea, and heart rate variability (HRV). We present reference values for Apnea/Hypopnea Index (AHI), sleep efficiency, Wake After Sleep Onset (WASO), and HRV sample entropy, stratified by age and sex. Moreover, we demonstrate that the dataset improves the predictive capability for various health related traits, including body composition, bone density, blood sugar levels and cardiovascular health. These results illustrate the dataset's potential to advance sleep research, personalized healthcare, and machine learning applications in biomedicine.
    Attribute Diversity Determines the Systematicity Gap in VQA. (arXiv:2311.08695v1 [cs.LG])
    The degree to which neural networks can generalize to new combinations of familiar concepts, and the conditions under which they are able to do so, has long been an open question. In this work, we study the systematicity gap in visual question answering: the performance difference between reasoning on previously seen and unseen combinations of object attributes. To test, we introduce a novel diagnostic dataset, CLEVR-HOPE. We find that while increased quantity of training data does not reduce the systematicity gap, increased training data diversity of the attributes in the unseen combination does. In all, our experiments suggest that the more distinct attribute type combinations are seen during training, the more systematic we can expect the resulting model to be.
    Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation. (arXiv:2311.08877v1 [cs.CL])
    To maintain user trust, large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user. The standard approach of estimating confidence is to use the softmax probabilities of these models, but as of November 2023, state-of-the-art LLMs such as GPT-4 and Claude-v1.3 do not provide access to these probabilities. We first study eliciting confidence linguistically -- asking an LLM for its confidence in its answer -- which performs reasonably (80.5% AUC on GPT-4 averaged across 12 question-answering datasets -- 7% above a random baseline) but leaves room for improvement. We then explore using a surrogate confidence model -- using a model where we do have probabilities to evaluate the original model's confidence in a given question. Surprisingly, even though these probabilities come from a different and often weaker model, this method leads to higher AUC than linguistic confidences on 9 out of 12 datasets. Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets (84.6% average AUC on GPT-4).
    Data Similarity is Not Enough to Explain Language Model Performance. (arXiv:2311.09006v1 [cs.CL])
    Large language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model's pretraining data is assumed to be easier for that model. We test whether distributional and example-specific similarity measures (embedding-, token- and model-based) correlate with language model performance through a large-scale comparison of the Pile and C4 pretraining datasets with downstream benchmarks. Similarity correlates with performance for multilingual datasets, but in other benchmarks, we surprisingly find that similarity metrics are not correlated with accuracy or even each other. This suggests that the relationship between pretraining data and downstream tasks is more complex than often assumed.
    Future-Dependent Value-Based Off-Policy Evaluation in POMDPs. (arXiv:2207.13081v2 [cs.LG] UPDATED)
    We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs. Future-dependent value functions play similar roles as classical value functions in fully-observable MDPs. We derive a new Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is consistent as long as futures and histories contain sufficient information about latent states, and the Bellman completeness. Finally, we extend our methods to learning of dynamics and establish the connection between our approach and the well-known spectral learning methods in POMDPs.
    Assessing the Robustness of Intelligence-Driven Reinforcement Learning. (arXiv:2311.09027v1 [cs.LG])
    Robustness to noise is of utmost importance in reinforcement learning systems, particularly in military contexts where high stakes and uncertain environments prevail. Noise and uncertainty are inherent features of military operations, arising from factors such as incomplete information, adversarial actions, or unpredictable battlefield conditions. In RL, noise can critically impact decision-making, mission success, and the safety of personnel. Reward machines offer a powerful tool to express complex reward structures in RL tasks, enabling the design of tailored reinforcement signals that align with mission objectives. This paper considers the problem of the robustness of intelligence-driven reinforcement learning based on reward machines. The preliminary results presented suggest the need for further research in evidential reasoning and learning to harden current state-of-the-art reinforcement learning approaches before being mission-critical-ready.
    Low-Frequency Load Identification using CNN-BiLSTM Attention Mechanism. (arXiv:2311.08536v1 [eess.SY])
    Non-intrusive Load Monitoring (NILM) is an established technique for effective and cost-efficient electricity consumption management. The method is used to estimate appliance-level power consumption from aggregated power measurements. This paper presents a hybrid learning approach, consisting of a convolutional neural network (CNN) and a bidirectional long short-term memory (BILSTM), featuring an integrated attention mechanism, all within the context of disaggregating low-frequency power data. While prior research has been mainly focused on high-frequency data disaggregation, our study takes a distinct direction by concentrating on low-frequency data. The proposed hybrid CNN-BILSTM model is adept at extracting both temporal (time-related) and spatial (location-related) features, allowing it to precisely identify energy consumption patterns at the appliance level. This accuracy is further enhanced by the attention mechanism, which aids the model in pinpointing crucial parts of the data for more precise event detection and load disaggregation. We conduct simulations using the existing low-frequency REDD dataset to assess our model performance. The results demonstrate that our proposed approach outperforms existing methods in terms of accuracy and computation time.
    2D-RC: Two-Dimensional Neural Network Approach for OTFS Symbol Detection. (arXiv:2311.08543v1 [eess.SP])
    Orthogonal time frequency space (OTFS) is a promising modulation scheme for wireless communication in high-mobility scenarios. Recently, a reservoir computing (RC) based approach has been introduced for online subframe-based symbol detection in the OTFS system, where only a limited number of over-the-air (OTA) pilot symbols are utilized for training. However, this approach does not leverage the domain knowledge specific to the OTFS system. This paper introduces a novel two-dimensional RC (2D-RC) method that incorporates the structural knowledge of the OTFS system into the design for online symbol detection on a subframe basis. Specifically, as the channel response acts as a two-dimensional (2D) operation over the transmitted information symbols in the delay-Doppler (DD) domain, the 2D-RC is designed to have a 2D structure to equalize the channel. With the introduced architecture, the 2D-RC can benefit from the predictable channel representation in the DD domain. Moreover, unlike the previous work that requires multiple RCs to learn the channel feature, the 2D-RC only requires a single neural network for detection. Experimental results demonstrate the effectiveness of the 2D-RC approach across different OTFS system variants and modulation orders.  ( 2 min )
    Enabling CMF Estimation in Data-Constrained Scenarios: A Semantic-Encoding Knowledge Mining Model. (arXiv:2311.08690v1 [cs.LG])
    Precise estimation of Crash Modification Factors (CMFs) is central to evaluating the effectiveness of various road safety treatments and prioritizing infrastructure investment accordingly. While customized study for each countermeasure scenario is desired, the conventional CMF estimation approaches rely heavily on the availability of crash data at given sites. This not only makes the estimation costly, but the results are also less transferable, since the intrinsic similarities between different safety countermeasure scenarios are not fully explored. Aiming to fill this gap, this study introduces a novel knowledge-mining framework for CMF prediction. This framework delves into the connections of existing countermeasures and reduces the reliance of CMF estimation on crash data availability and manual data collection. Specifically, it draws inspiration from human comprehension processes and introduces advanced Natural Language Processing (NLP) techniques to extract intricate variations and patterns from existing CMF knowledge. It effectively encodes unstructured countermeasure scenarios into machine-readable representations and models the complex relationships between scenarios and CMF values. This new data-driven framework provides a cost-effective and adaptable solution that complements the case-specific approaches for CMF estimation, which is particularly beneficial when availability of crash data or time imposes constraints. Experimental validation using real-world CMF Clearinghouse data demonstrates the effectiveness of this new approach, which shows significant accuracy improvements compared to baseline methods. This approach provides insights into new possibilities of harnessing accumulated transportation knowledge in various applications.  ( 3 min )
    ConeQuest: A Benchmark for Cone Segmentation on Mars. (arXiv:2311.08657v1 [cs.CV])
    Over the years, space scientists have collected terabytes of Mars data from satellites and rovers. One important set of features identified in Mars orbital images is pitted cones, which are interpreted to be mud volcanoes believed to form in regions that were once saturated in water (i.e., a lake or ocean). Identifying pitted cones globally on Mars would be of great importance, but expert geologists are unable to sort through the massive orbital image archives to identify all examples. However, this task is well suited for computer vision. Although several computer vision datasets exist for various Mars-related tasks, there is currently no open-source dataset available for cone detection/segmentation. Furthermore, previous studies trained models using data from a single region, which limits their applicability for global detection and mapping. Motivated by this, we introduce ConeQuest, the first expert-annotated public dataset to identify cones on Mars. ConeQuest consists of >13k samples from 3 different regions of Mars. We propose two benchmark tasks using ConeQuest: (i) Spatial Generalization and (ii) Cone-size Generalization. We finetune and evaluate widely-used segmentation models on both benchmark tasks. Results indicate that cone segmentation is a challenging open problem not solved by existing segmentation models, which achieve an average IoU of 52.52% and 42.55% on in-distribution data for tasks (i) and (ii), respectively. We believe this new benchmark dataset will facilitate the development of more accurate and robust models for cone segmentation. Data and code are available at https://github.com/kerner-lab/ConeQuest.  ( 2 min )
    Variational Quantum Eigensolver with Constraints (VQEC): Solving Constrained Optimization Problems via VQE. (arXiv:2311.08502v1 [quant-ph])
    Variational quantum approaches have shown great promise in finding near-optimal solutions to computationally challenging tasks. Nonetheless, enforcing constraints in a disciplined fashion has been largely unexplored. To address this gap, this work proposes a hybrid quantum-classical algorithmic paradigm termed VQEC that extends the celebrated VQE to handle optimization with constraints. As with the standard VQE, the vector of optimization variables is captured by the state of a variational quantum circuit (VQC). To deal with constraints, VQEC optimizes a Lagrangian function classically over both the VQC parameters as well as the dual variables associated with constraints. To comply with the quantum setup, variables are updated via a perturbed primal-dual method leveraging the parameter shift rule. Among a wide gamut of potential applications, we showcase how VQEC can approximately solve quadratically-constrained binary optimization (QCBO) problems, find stochastic binary policies satisfying quadratic constraints on the average and in probability, and solve large-scale linear programs (LP) over the probability simplex. Under an assumption on the error for the VQC to approximate an arbitrary probability mass function (PMF), we provide bounds on the optimality gap attained by a VQC. Numerical tests on a quantum simulator investigate the effect of various parameters and corroborate that VQEC can generate high-quality solutions.  ( 2 min )
    Deep Neural Network Identification of Limnonectes Species and New Class Detection Using Image Data. (arXiv:2311.08661v1 [stat.ML])
    As is true of many complex tasks, the work of discovering, describing, and understanding the diversity of life on Earth (viz., biological systematics and taxonomy) requires many tools. Some of this work can be accomplished as it has been done in the past, but some aspects present us with challenges which traditional knowledge and tools cannot adequately resolve. One such challenge is presented by species complexes in which the morphological similarities among the group members make it difficult to reliably identify known species and detect new ones. We address this challenge by developing new tools using the principles of machine learning to resolve two specific questions related to species complexes. The first question is formulated as a classification problem in statistics and machine learning and the second question is an out-of-distribution (OOD) detection problem. We apply these tools to a species complex comprising Southeast Asian stream frogs (Limnonectes kuhlii complex) and employ a morphological character (hind limb skin texture) traditionally treated qualitatively in a quantitative and objective manner. We demonstrate that deep neural networks can successfully automate the classification of an image into a known species group for which it has been trained. We further demonstrate that the algorithm can successfully classify an image into a new class if the image does not belong to the existing classes. Additionally, we use the larger MNIST dataset to test the performance of our OOD detection algorithm. We finish our paper with some concluding remarks regarding the application of these methods to species complexes and our efforts to document true biodiversity. This paper has online supplementary materials.  ( 3 min )
    It Takes Two to Negotiate: Modeling Social Exchange in Online Multiplayer Games. (arXiv:2311.08666v1 [cs.CL])
    Online games are dynamic environments where players interact with each other, which offers a rich setting for understanding how players negotiate their way through the game to an ultimate victory. This work studies online player interactions during the turn-based strategy game, Diplomacy. We annotated a dataset of over 10,000 chat messages for different negotiation strategies and empirically examined their importance in predicting long- and short-term game outcomes. Although negotiation strategies can be predicted reasonably accurately through the linguistic modeling of the chat messages, more is needed for predicting short-term outcomes such as trustworthiness. On the other hand, they are essential in graph-aware reinforcement learning approaches to predict long-term outcomes, such as a player's success, based on their prior negotiation history. We close with a discussion of the implications and impact of our work. The dataset is available at https://github.com/kj2013/claff-diplomacy.  ( 2 min )
    DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models. (arXiv:2311.08623v1 [cs.CV])
    Encoder-decoder transformer models have achieved great success on various vision-language (VL) tasks, but they suffer from high inference latency. Typically, the decoder takes up most of the latency because of the auto-regressive decoding. To accelerate the inference, we propose an approach of performing Dynamic Early Exit on Decoder (DEED). We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions. In addition, we leverage simple yet practical techniques, including shared generation head and adaptation modules, to keep accuracy when exiting at shallow decoder layers. Based on the multi-exit model, we perform step-level dynamic early exit during inference, where the model may decide to use fewer decoder layers based on its confidence of the current layer at each individual decoding step. Considering different number of decoder layers may be used at different decoding steps, we compute deeper-layer decoder features of previous decoding steps just-in-time, which ensures the features from different decoding steps are semantically aligned. We evaluate our approach with two state-of-the-art encoder-decoder transformer models on various VL tasks. We show our approach can reduce overall inference latency by 30%-60% with comparable or even higher accuracy compared to baselines.  ( 2 min )
    Clinical Characteristics and Laboratory Biomarkers in ICU-admitted Septic Patients with and without Bacteremia. (arXiv:2311.08433v1 [q-bio.QM])
    Few studies have investigated the diagnostic utilities of biomarkers for predicting bacteremia among septic patients admitted to intensive care units (ICU). Therefore, this study evaluated the prediction power of laboratory biomarkers to utilize those markers with high performance to optimize the predictive model for bacteremia. This retrospective cross-sectional study was conducted at the ICU department of Gyeongsang National University Changwon Hospital in 2019. Adult patients qualifying SEPSIS-3 (increase in sequential organ failure score greater than or equal to 2) criteria with at least two sets of blood culture were selected. Collected data was initially analyzed independently to identify the significant predictors, which was then used to build the multivariable logistic regression (MLR) model. A total of 218 patients with 48 cases of true bacteremia were analyzed in this research. Both CRP and PCT showed a substantial area under the curve (AUC) value for discriminating bacteremia among septic patients (0.757 and 0.845, respectively). To further enhance the predictive accuracy, we combined PCT, bilirubin, neutrophil lymphocyte ratio (NLR), platelets, lactic acid, erythrocyte sedimentation rate (ESR), and Glasgow Coma Scale (GCS) score to build the predictive model with an AUC of 0.907 (95% CI, 0.843 to 0.956). In addition, a high association between bacteremia and mortality rate was discovered through the survival analysis (0.004). While PCT is certainly a useful index for distinguishing patients with and without bacteremia by itself, our MLR model indicates that the accuracy of bacteremia prediction substantially improves by the combined use of PCT, bilirubin, NLR, platelets, lactic acid, ESR, and GCS score.  ( 3 min )
    Review of AlexNet for Medical Image Classification. (arXiv:2311.08655v1 [cs.CV])
    In recent years, the rapid development of deep learning has led to a wide range of applications in the field of medical image classification. The variants of neural network models with ever-increasing performance share some commonalities: to try to mitigate overfitting, improve generalization, avoid gradient vanishing and exploding, etc. AlexNet first utilizes the dropout technique to mitigate overfitting and the ReLU activation function to avoid gradient vanishing. Therefore, we focus our discussion on AlexNet, which has contributed greatly to the development of CNNs in 2012. After reviewing over 40 papers, including journal papers and conference papers, we give a narrative on the technical details, advantages, and application areas of AlexNet.  ( 2 min )
    Coreset Selection with Prioritized Multiple Objectives. (arXiv:2311.08675v1 [cs.LG])
    Coreset selection is powerful in reducing computational costs and accelerating data processing for deep learning algorithms. It strives to identify a small subset from large-scale data, so that training only on the subset practically performs on par with full data. When coreset selection is applied in realistic scenes, under the premise that the identified coreset has achieved comparable model performance, practitioners regularly desire the identified coreset can have a size as small as possible for lower costs and greater acceleration. Motivated by this desideratum, for the first time, we pose the problem of "coreset selection with prioritized multiple objectives", in which the smallest coreset size under model performance constraints is explored. Moreover, to address this problem, an innovative method is proposed, which maintains optimization priority order over the model performance and coreset size, and efficiently optimizes them in the coreset selection procedure. Theoretically, we provide the convergence guarantee of the proposed method. Empirically, extensive experiments confirm its superiority compared with previous strategies, often yielding better model performance with smaller coreset sizes.  ( 2 min )
    Towards Generalizable SER: Soft Labeling and Data Augmentation for Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech. (arXiv:2311.08607v1 [cs.CL])
    Recognizing emotions in spoken communication is crucial for advanced human-machine interaction. Current emotion detection methodologies often display biases when applied cross-corpus. To address this, our study amalgamates 16 diverse datasets, resulting in 375 hours of data across languages like English, Chinese, and Japanese. We propose a soft labeling system to capture gradational emotional intensities. Using the Whisper encoder and data augmentation methods inspired by contrastive learning, our method emphasizes the temporal dynamics of emotions. Our validation on four multilingual datasets demonstrates notable zero-shot generalization. We publish our open source model weights and initial promising results after fine-tuning on Hume-Prosody.  ( 2 min )
    Towards Evaluating AI Systems for Moral Status Using Self-Reports. (arXiv:2311.08576v1 [cs.LG])
    As AI systems become more advanced and widely deployed, there will likely be increasing debate over whether AI systems could have conscious experiences, desires, or other states of potential moral significance. It is important to inform these discussions with empirical evidence to the extent possible. We argue that under the right circumstances, self-reports, or an AI system's statements about its own internal states, could provide an avenue for investigating whether AI systems have states of moral significance. Self-reports are the main way such states are assessed in humans ("Are you in pain?"), but self-reports from current systems like large language models are spurious for many reasons (e.g. often just reflecting what humans would say). To make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports. The hope of this approach is that models will develop introspection-like capabilities, and that these capabilities will generalize to questions about states of moral significance. We then propose methods for assessing the extent to which these techniques have succeeded: evaluating self-report consistency across contexts and between similar models, measuring the confidence and resilience of models' self-reports, and using interpretability to corroborate self-reports. We also discuss challenges for our approach, from philosophical difficulties in interpreting self-reports to technical reasons why our proposal might fail. We hope our discussion inspires philosophers and AI researchers to criticize and improve our proposed methodology, as well as to run experiments to test whether self-reports can be made reliable enough to provide information about states of moral significance.  ( 3 min )
    Image complexity based fMRI-BOLD visual network categorization across visual datasets using topological descriptors and deep-hybrid learning. (arXiv:2311.08417v1 [eess.IV])
    This study proposes a new approach that investigates differences in topological characteristics of visual networks, which are constructed using fMRI BOLD time-series corresponding to visual datasets of COCO, ImageNet, and SUN. A publicly available BOLD5000 dataset is utilized that contains fMRI scans while viewing 5254 images of diverse complexities. The objective of this study is to examine how network topology differs in response to distinct visual stimuli from these visual datasets. To achieve this, 0- and 1-dimensional persistence diagrams are computed for each visual network representing COCO, ImageNet, and SUN. For extracting suitable features from topological persistence diagrams, K-means clustering is executed. The extracted K-means cluster features are fed to a novel deep-hybrid model that yields accuracy in the range of 90%-95% in classifying these visual networks. To understand vision, this type of visual network categorization across visual datasets is important as it captures differences in BOLD signals while perceiving images with different contexts and complexities. Furthermore, distinctive topological patterns of visual network associated with each dataset, as revealed from this study, could potentially lead to the development of future neuroimaging biomarkers for diagnosing visual processing disorders like visual agnosia or prosopagnosia, and tracking changes in visual cognition over time.  ( 3 min )
    Understanding Calibration for Multilingual Question Answering Models. (arXiv:2311.08669v1 [cs.CL])
    Multilingual pre-trained language models are incredibly effective at Question Answering (QA), a core task in Natural Language Understanding, achieving high accuracies on several multilingual benchmarks. However, little is known about how well they are calibrated. In this paper, we study the calibration properties of several pre-trained multilingual large language models (LLMs) on a variety of question-answering tasks. We perform extensive experiments, spanning both extractive and generative QA model designs and diverse languages, spanning both high-resource and low-resource ones. We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings, and investigate strategies to improve it, including post-hoc methods and regularized fine-tuning. We demonstrate automatically translated data augmentation as a highly effective technique to improve model calibration. We also conduct a number of ablation experiments to study the effect of model size on calibration and how multilingual models compare with their monolingual counterparts for diverse tasks and languages.  ( 2 min )
    Non-Contact Breathing Rate Detection Using Optical Flow. (arXiv:2311.08426v1 [eess.IV])
    Breathing rate is a vital health metric that is an invaluable indicator of the overall health of a person. In recent years, the non-contact measurement of health signals such as breathing rate has been a huge area of development, with a wide range of applications from telemedicine to driver monitoring systems. This paper presents an investigation into a method of non-contact breathing rate detection using a motion detection algorithm, optical flow. Optical flow is used to successfully measure breathing rate by tracking the motion of specific points on the body. In this study, the success of optical flow when using different sets of points is evaluated. Testing shows that both chest and facial movement can be used to determine breathing rate but to different degrees of success. The chest generates very accurate signals, with an RMSE of 0.63 on the tested videos. Facial points can also generate reliable signals when there is minimal head movement but are much more vulnerable to noise caused by head/body movements. These findings highlight the potential of optical flow as a non-invasive method for breathing rate detection and emphasize the importance of selecting appropriate points to optimize accuracy.  ( 2 min )
    Natural Language Processing for Financial Regulation. (arXiv:2311.08533v1 [cs.CL])
    This article provides an understanding of Natural Language Processing techniques in the framework of financial regulation, more specifically in order to perform semantic matching search between rules and policy when no dataset is available for supervised learning. We outline how to outperform simple pre-trained sentences-transformer models using freely available resources and explain the mathematical concepts behind the key building blocks of Natural Language Processing.  ( 2 min )
    Mean-field variational inference with the TAP free energy: Geometric and statistical properties in linear models. (arXiv:2311.08442v1 [math.ST])
    We study mean-field variational inference in a Bayesian linear model when the sample size n is comparable to the dimension p. In high dimensions, the common approach of minimizing a Kullback-Leibler divergence from the posterior distribution, or maximizing an evidence lower bound, may deviate from the true posterior mean and underestimate posterior uncertainty. We study instead minimization of the TAP free energy, showing in a high-dimensional asymptotic framework that it has a local minimizer which provides a consistent estimate of the posterior marginals and may be used for correctly calibrated posterior inference. Geometrically, we show that the landscape of the TAP free energy is strongly convex in an extensive neighborhood of this local minimizer, which under certain general conditions can be found by an Approximate Message Passing (AMP) algorithm. We then exhibit an efficient algorithm that linearly converges to the minimizer within this local neighborhood. In settings where it is conjectured that no efficient algorithm can find this local neighborhood, we prove analogous geometric properties for a local minimizer of the TAP free energy reachable by AMP, and show that posterior inference based on this minimizer remains correctly calibrated.  ( 2 min )
    Adversarial Imitation Learning On Aggregated Data. (arXiv:2311.08568v1 [cs.LG])
    Inverse Reinforcement Learning (IRL) learns an optimal policy, given some expert demonstrations, thus avoiding the need for the tedious process of specifying a suitable reward function. However, current methods are constrained by at least one of the following requirements. The first one is the need to fully solve a forward Reinforcement Learning (RL) problem in the inner loop of the algorithm, which might be prohibitively expensive in many complex environments. The second one is the need for full trajectories from the experts, which might not be easily available. The third one is the assumption that the expert data is homogeneous rather than a collection from various experts or possibly alternative solutions to the same task. Such constraints make IRL approaches either not scalable or not usable on certain existing systems. In this work we propose an approach which removes these requirements through a dynamic, adaptive method called Adversarial Imitation Learning on Aggregated Data (AILAD). It learns conjointly both a non linear reward function and the associated optimal policy using an adversarial framework. The reward learner only uses aggregated data. Moreover, it generates diverse behaviors producing a distribution over the aggregated data matching that of the experts.  ( 2 min )
    Spatio-Temporal Graph Neural Point Process for Traffic Congestion Event Prediction. (arXiv:2311.08635v1 [cs.LG])
    Traffic congestion event prediction is an important yet challenging task in intelligent transportation systems. Many existing works about traffic prediction integrate various temporal encoders and graph convolution networks (GCNs), called spatio-temporal graph-based neural networks, which focus on predicting dense variables such as flow, speed and demand in time snapshots, but they can hardly forecast the traffic congestion events that are sparsely distributed on the continuous time axis. In recent years, neural point process (NPP) has emerged as an appropriate framework for event prediction in continuous time scenarios. However, most conventional works about NPP cannot model the complex spatio-temporal dependencies and congestion evolution patterns. To address these limitations, we propose a spatio-temporal graph neural point process framework, named STGNPP for traffic congestion event prediction. Specifically, we first design the spatio-temporal graph learning module to fully capture the long-range spatio-temporal dependencies from the historical traffic state data along with the road network. The extracted spatio-temporal hidden representation and congestion event information are then fed into a continuous gated recurrent unit to model the congestion evolution patterns. In particular, to fully exploit the periodic information, we also improve the intensity function calculation of the point process with a periodic gated mechanism. Finally, our model simultaneously predicts the occurrence time and duration of the next congestion. Extensive experiments on two real-world datasets demonstrate that our method achieves superior performance in comparison to existing state-of-the-art approaches.  ( 3 min )
    Uncertainty Quantification in Neural-Network Based Pain Intensity Estimation. (arXiv:2311.08569v1 [cs.LG])
    Improper pain management can lead to severe physical or mental consequences, including suffering, and an increased risk of opioid dependency. Assessing the presence and severity of pain is imperative to prevent such outcomes and determine the appropriate intervention. However, the evaluation of pain intensity is challenging because different individuals experience pain differently. To overcome this, researchers have employed machine learning models to evaluate pain intensity objectively. However, these efforts have primarily focused on point estimation of pain, disregarding the inherent uncertainty and variability present in the data and model. Consequently, the point estimates provide only partial information for clinical decision-making. This study presents a neural network-based method for objective pain interval estimation, incorporating uncertainty quantification. This work explores three algorithms: the bootstrap method, lower and upper bound estimation (LossL) optimized by genetic algorithm, and modified lower and upper bound estimation (LossS) optimized by gradient descent algorithm. Our empirical results reveal that LossS outperforms the other two by providing a narrower prediction interval. As LossS outperforms, we assessed its performance in three different scenarios for pain assessment: (1) a generalized approach (single model for the entire population), (2) a personalized approach (separate model for each individual), and (3) a hybrid approach (separate model for each cluster of individuals). Our findings demonstrate the hybrid approach's superior performance, with notable practicality in clinical contexts. It has the potential to be a valuable tool for clinicians, enabling objective pain intensity assessment while taking uncertainty into account. This capability is crucial in facilitating effective pain management and reducing the risks associated with improper treatment.  ( 3 min )
    Deep Phenotyping of Non-Alcoholic Fatty Liver Disease Patients with Genetic Factors for Insights into the Complex Disease. (arXiv:2311.08428v1 [q-bio.QM])
    Non-alcoholic fatty liver disease (NAFLD) is a prevalent chronic liver disorder characterized by the excessive accumulation of fat in the liver in individuals who do not consume significant amounts of alcohol, including risk factors like obesity, insulin resistance, type 2 diabetes, etc. We aim to identify subgroups of NAFLD patients based on demographic, clinical, and genetic characteristics for precision medicine. The genomic and phenotypic data (3,408 cases and 4,739 controls) for this study were gathered from participants in Mayo Clinic Tapestry Study (IRB#19-000001) and their electric health records, including their demographic, clinical, and comorbidity data, and the genotype information through whole exome sequencing performed at Helix using the Exome+$^\circledR$ Assay according to standard procedure (www$.$helix$.$com). Factors highly relevant to NAFLD were determined by the chi-square test and stepwise backward-forward regression model. Latent class analysis (LCA) was performed on NAFLD cases using significant indicator variables to identify subgroups. The optimal clustering revealed 5 latent subgroups from 2,013 NAFLD patients (mean age 60.6 years and 62.1% women), while a polygenic risk score based on 6 single-nucleotide polymorphism (SNP) variants and disease outcomes were used to analyze the subgroups. The groups are characterized by metabolic syndrome, obesity, different comorbidities, psychoneurological factors, and genetic factors. Odds ratios were utilized to compare the risk of complex diseases, such as fibrosis, cirrhosis, and hepatocellular carcinoma (HCC), as well as liver failure between the clusters. Cluster 2 has a significantly higher complex disease outcome compared to other clusters. Keywords: Fatty liver disease; Polygenic risk score; Precision medicine; Deep phenotyping; NAFLD comorbidities; Latent class analysis.  ( 3 min )
    On semi-supervised estimation using exponential tilt mixture models. (arXiv:2311.08504v1 [stat.ML])
    Consider a semi-supervised setting with a labeled dataset of binary responses and predictors and an unlabeled dataset with only the predictors. Logistic regression is equivalent to an exponential tilt model in the labeled population. For semi-supervised estimation, we develop further analysis and understanding of a statistical approach using exponential tilt mixture (ETM) models and maximum nonparametric likelihood estimation, while allowing that the class proportions may differ between the unlabeled and labeled data. We derive asymptotic properties of ETM-based estimation and demonstrate improved efficiency over supervised logistic regression in a random sampling setup and an outcome-stratified sampling setup previously used. Moreover, we reconcile such efficiency improvement with the existing semiparametric efficiency theory when the class proportions in the unlabeled and labeled data are restricted to be the same. We also provide a simulation study to numerically illustrate our theoretical findings.  ( 2 min )
    LLMs cannot find reasoning errors, but can correct them!. (arXiv:2311.08516v1 [cs.AI])
    While self-correction has shown promise in improving LLM outputs in terms of style and quality (e.g. Chen et al., 2023; Madaan et al., 2023), recent attempts to self-correct logical or reasoning errors often cause correct answers to become incorrect, resulting in worse performances overall (Huang et al., 2023). In this paper, we break down the self-correction process into two core components: mistake finding and output correction. For mistake finding, we release BIG-Bench Mistake, a dataset of logical mistakes in Chain-of-Thought reasoning traces. We provide benchmark numbers for several state-of-the-art LLMs, and demonstrate that LLMs generally struggle with finding logical mistakes. For output correction, we propose a backtracking method which provides large improvements when given information on mistake location. We construe backtracking as a lightweight alternative to reinforcement learning methods, and show that it remains effective with a reward model at 60-70% accuracy.  ( 2 min )
    Uplift Modeling based on Graph Neural Network Combined with Causal Knowledge. (arXiv:2311.08434v1 [cs.LG])
    Uplift modeling is a fundamental component of marketing effect modeling, which is commonly employed to evaluate the effects of treatments on outcomes. Through uplift modeling, we can identify the treatment with the greatest benefit. On the other side, we can identify clients who are likely to make favorable decisions in response to a certain treatment. In the past, uplift modeling approaches relied heavily on the difference-in-difference (DID) architecture, paired with a machine learning model as the estimation learner, while neglecting the link and confidential information between features. We proposed a framework based on graph neural networks that combine causal knowledge with an estimate of uplift value. Firstly, we presented a causal representation technique based on CATE (conditional average treatment effect) estimation and adjacency matrix structure learning. Secondly, we suggested a more scalable uplift modeling framework based on graph convolution networks for combining causal knowledge. Our findings demonstrate that this method works effectively for predicting uplift values, with small errors in typical simulated data, and its effectiveness has been verified in actual industry marketing data.  ( 2 min )
    Surrogate Neural Networks to Estimate Parametric Sensitivity of Ocean Models. (arXiv:2311.08421v1 [physics.ao-ph])
    Modeling is crucial to understanding the effect of greenhouse gases, warming, and ice sheet melting on the ocean. At the same time, ocean processes affect phenomena such as hurricanes and droughts. Parameters in the models that cannot be physically measured have a significant effect on the model output. For an idealized ocean model, we generated perturbed parameter ensemble data and trained surrogate neural network models. The neural surrogates accurately predicted the one-step forward dynamics, of which we then computed the parametric sensitivity.  ( 2 min )
  • Open

    Learning Fair Division from Bandit Feedback. (arXiv:2311.09068v1 [cs.LG])
    This work addresses learning online fair division under uncertainty, where a central planner sequentially allocates items without precise knowledge of agents' values or utilities. Departing from conventional online algorithm, the planner here relies on noisy, estimated values obtained after allocating items. We introduce wrapper algorithms utilizing \textit{dual averaging}, enabling gradual learning of both the type distribution of arriving items and agents' values through bandit feedback. This approach enables the algorithms to asymptotically achieve optimal Nash social welfare in linear Fisher markets with agents having additive utilities. We establish regret bounds in Nash social welfare and empirically validate the superior performance of our proposed algorithms across synthetic and empirical datasets.
    Supervised and Penalized Baseline Correction. (arXiv:2310.18306v2 [stat.ML] UPDATED)
    Spectroscopic measurements can show distorted spectral shapes arising from a mixture of absorbing and scattering contributions. These distortions (or baselines) often manifest themselves as non-constant offsets or low-frequency oscillations. As a result, these baselines can adversely affect analytical and quantitative results. Baseline correction is an umbrella term where one applies pre-processing methods to obtain baseline spectra (the unwanted distortions) and then remove the distortions by differencing. However, current state-of-the art baseline correction methods do not utilize analyte concentrations even if they are available, or even if they contribute significantly to the observed spectral variability. We examine a class of state-of-the-art methods (penalized baseline correction) and modify them such that they can accommodate a priori analyte concentrations such that prediction can be enhanced. Performance will be assessed on two near infra-red data sets across both classical penalized baseline correction methods (without analyte information) and modified penalized baseline correction methods (leveraging analyte information).
    Graph-based Neural Weather Prediction for Limited Area Modeling. (arXiv:2309.17370v2 [cs.LG] UPDATED)
    The rise of accurate machine learning methods for weather forecasting is creating radical new possibilities for modeling the atmosphere. In the time of climate change, having access to high-resolution forecasts from models like these is also becoming increasingly vital. While most existing Neural Weather Prediction (NeurWP) methods focus on global forecasting, an important question is how these techniques can be applied to limited area modeling. In this work we adapt the graph-based NeurWP approach to the limited area setting and propose a multi-scale hierarchical model extension. Our approach is validated by experiments with a local model for the Nordic region.
    Error Bounds for Learning with Vector-Valued Random Features. (arXiv:2305.17170v2 [stat.ML] UPDATED)
    This paper provides a comprehensive error analysis of learning with vector-valued random features (RF). The theory is developed for RF ridge regression in a fully general infinite-dimensional input-output setting, but nonetheless applies to and improves existing finite-dimensional analyses. In contrast to comparable work in the literature, the approach proposed here relies on a direct analysis of the underlying risk functional and completely avoids the explicit RF ridge regression solution formula in terms of random matrices. This removes the need for concentration results in random matrix theory or their generalizations to random operators. The main results established in this paper include strong consistency of vector-valued RF estimators under model misspecification and minimax optimal convergence rates in the well-specified setting. The parameter complexity (number of random features) and sample complexity (number of labeled data) required to achieve such rates are comparable with Monte Carlo intuition and free from logarithmic factors.
    Supervised low-rank semi-nonnegative matrix factorization with frequency regularization for forecasting spatio-temporal data. (arXiv:2311.08636v1 [stat.ML])
    We propose a novel methodology for forecasting spatio-temporal data using supervised semi-nonnegative matrix factorization (SSNMF) with frequency regularization. Matrix factorization is employed to decompose spatio-temporal data into spatial and temporal components. To improve clarity in the temporal patterns, we introduce a nonnegativity constraint on the time domain along with regularization in the frequency domain. Specifically, regularization in the frequency domain involves selecting features in the frequency space, making an interpretation in the frequency domain more convenient. We propose two methods in the frequency domain: soft and hard regularizations, and provide convergence guarantees to first-order stationary points of the corresponding constrained optimization problem. While our primary motivation stems from geophysical data analysis based on GRACE (Gravity Recovery and Climate Experiment) data, our methodology has the potential for wider application. Consequently, when applying our methodology to GRACE data, we find that the results with the proposed methodology are comparable to previous research in the field of geophysical sciences but offer clearer interpretability.
    Taming under isoperimetry. (arXiv:2311.09003v1 [math.PR])
    In this article we propose a novel taming Langevin-based scheme called $\mathbf{sTULA}$ to sample from distributions with superlinearly growing log-gradient which also satisfy a Log-Sobolev inequality. We derive non-asymptotic convergence bounds in $KL$ and consequently total variation and Wasserstein-$2$ distance from the target measure. Non-asymptotic convergence guarantees are provided for the performance of the new algorithm as an optimizer. Finally, some theoretical results on isoperimertic inequalities for distributions with superlinearly growing gradients are provided. Key findings are a Log-Sobolev inequality with constant independent of the dimension, in the presence of a higher order regularization and a Poincare inequality with constant independent of temperature and dimension under a novel non-convex theoretical framework.
    On the Foundation of Distributionally Robust Reinforcement Learning. (arXiv:2311.09018v1 [cs.LG])
    Motivated by the need for a robust policy in the face of environment shifts between training and the deployment, we contribute to the theoretical foundation of distributionally robust reinforcement learning (DRRL). This is accomplished through a comprehensive modeling framework centered around distributionally robust Markov decision processes (DRMDPs). This framework obliges the decision maker to choose an optimal policy under the worst-case distributional shift orchestrated by an adversary. By unifying and extending existing formulations, we rigorously construct DRMDPs that embraces various modeling attributes for both the decision maker and the adversary. These attributes include adaptability granularity, exploring history-dependent, Markov, and Markov time-homogeneous decision maker and adversary dynamics. Additionally, we delve into the flexibility of shifts induced by the adversary, examining SA and S-rectangularity. Within this DRMDP framework, we investigate conditions for the existence or absence of the dynamic programming principle (DPP). From an algorithmic standpoint, the existence of DPP holds significant implications, as the vast majority of existing data and computationally efficiency RL algorithms are reliant on the DPP. To study its existence, we comprehensively examine combinations of controller and adversary attributes, providing streamlined proofs grounded in a unified methodology. We also offer counterexamples for settings in which a DPP with full generality is absent.
    Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit. (arXiv:2308.10238v3 [cs.LG] UPDATED)
    We study the real-valued combinatorial pure exploration of the multi-armed bandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given $d$ stochastic arms, and the reward of each arm $s\in\{1, \ldots, d\}$ follows an unknown distribution with mean $\mu_s$. In each time step, a player pulls a single arm and observes its reward. The player's goal is to identify the optimal \emph{action} $\boldsymbol{\pi}^{*} = \argmax_{\boldsymbol{\pi} \in \mathcal{A}} \boldsymbol{\mu}^{\top}\boldsymbol{\pi}$ from a finite-sized real-valued \emph{action set} $\mathcal{A}\subset \mathbb{R}^{d}$ with as few arm pulls as possible. Previous methods in the R-CPE-MAB assume that the size of the action set $\mathcal{A}$ is polynomial in $d$. We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large in $d$. We also introduce a novel problem-dependent sample complexity lower bound of the R-CPE-MAB problem, and show that the GenTS-Explore algorithm achieves the optimal sample complexity up to a problem-dependent constant factor.
    Manifold learning in Wasserstein space. (arXiv:2311.08549v1 [stat.ML])
    This paper aims at building the theoretical foundations for manifold learning algorithms in the space of absolutely continuous probability measures on a compact and convex subset of $\mathbb{R}^d$, metrized with the Wasserstein-2 distance $W$. We begin by introducing a natural construction of submanifolds $\Lambda$ of probability measures equipped with metric $W_\Lambda$, the geodesic restriction of $W$ to $\Lambda$. In contrast to other constructions, these submanifolds are not necessarily flat, but still allow for local linearizations in a similar fashion to Riemannian submanifolds of $\mathbb{R}^d$. We then show how the latent manifold structure of $(\Lambda,W_{\Lambda})$ can be learned from samples $\{\lambda_i\}_{i=1}^N$ of $\Lambda$ and pairwise extrinsic Wasserstein distances $W$ only. In particular, we show that the metric space $(\Lambda,W_{\Lambda})$ can be asymptotically recovered in the sense of Gromov--Wasserstein from a graph with nodes $\{\lambda_i\}_{i=1}^N$ and edge weights $W(\lambda_i,\lambda_j)$. In addition, we demonstrate how the tangent space at a sample $\lambda$ can be asymptotically recovered via spectral analysis of a suitable "covariance operator" using optimal transport maps from $\lambda$ to sufficiently close and diverse samples $\{\lambda_i\}_{i=1}^N$. The paper closes with some explicit constructions of submanifolds $\Lambda$ and numerical examples on the recovery of tangent spaces through spectral analysis.
    German FinBERT: A German Pre-trained Language Model. (arXiv:2311.08793v1 [cs.CL])
    This study presents German FinBERT, a novel pre-trained German language model tailored for financial textual data. The model is trained through a comprehensive pre-training process, leveraging a substantial corpus comprising financial reports, ad-hoc announcements and news related to German companies. The corpus size is comparable to the data sets commonly used for training standard BERT models. I evaluate the performance of German FinBERT on downstream tasks, specifically sentiment prediction, topic recognition and question answering against generic German language models. My results demonstrate improved performance on finance-specific data, indicating the efficacy of German FinBERT in capturing domain-specific nuances. The presented findings suggest that German FinBERT holds promise as a valuable tool for financial text analysis, potentially benefiting various applications in the financial domain.
    Measuring association with recursive rank binning. (arXiv:2311.08561v1 [stat.ME])
    Pairwise measures of dependence are a common tool to map data in the early stages of analysis with several modern examples based on maximized partitions of the pairwise sample space. Following a short survey of modern measures of dependence, we introduce a new measure which recursively splits the ranks of a pair of variables to partition the sample space and computes the $\chi^2$ statistic on the resulting bins. Splitting logic is detailed for splits maximizing a score function and randomly selected splits. Simulations indicate that random splitting produces a statistic conservatively approximated by the $\chi^2$ distribution without a loss of power to detect numerous different data patterns compared to maximized binning. Though it seems to add no power to detect dependence, maximized recursive binning is shown to produce a natural visualization of the data and the measure. Applying maximized recursive rank binning to S&P 500 constituent data suggests the automatic detection of tail dependence.
    Future-Dependent Value-Based Off-Policy Evaluation in POMDPs. (arXiv:2207.13081v2 [cs.LG] UPDATED)
    We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs. Future-dependent value functions play similar roles as classical value functions in fully-observable MDPs. We derive a new Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is consistent as long as futures and histories contain sufficient information about latent states, and the Bellman completeness. Finally, we extend our methods to learning of dynamics and establish the connection between our approach and the well-known spectral learning methods in POMDPs.  ( 2 min )
    Variational Temporal IRT: Fast, Accurate, and Explainable Inference of Dynamic Learner Proficiency. (arXiv:2311.08594v1 [cs.LG])
    Dynamic Item Response Models extend the standard Item Response Theory (IRT) to capture temporal dynamics in learner ability. While these models have the potential to allow instructional systems to actively monitor the evolution of learner proficiency in real time, existing dynamic item response models rely on expensive inference algorithms that scale poorly to massive datasets. In this work, we propose Variational Temporal IRT (VTIRT) for fast and accurate inference of dynamic learner proficiency. VTIRT offers orders of magnitude speedup in inference runtime while still providing accurate inference. Moreover, the proposed algorithm is intrinsically interpretable by virtue of its modular design. When applied to 9 real student datasets, VTIRT consistently yields improvements in predicting future learner performance over other learner proficiency models.  ( 2 min )
    Semidefinite programs simulate approximate message passing robustly. (arXiv:2311.09017v1 [cs.DS])
    Approximate message passing (AMP) is a family of iterative algorithms that generalize matrix power iteration. AMP algorithms are known to optimally solve many average-case optimization problems. In this paper, we show that a large class of AMP algorithms can be simulated in polynomial time by \emph{local statistics hierarchy} semidefinite programs (SDPs), even when an unknown principal minor of measure $1/\mathrm{polylog}(\mathrm{dimension})$ is adversarially corrupted. Ours are the first robust guarantees for many of these problems. Further, our results offer an interesting counterpoint to strong lower bounds against less constrained SDP relaxations for average-case max-cut-gain (a.k.a. "optimizing the Sherrington-Kirkpatrick Hamiltonian") and other problems.  ( 2 min )
    Federated Learning for Sparse Principal Component Analysis. (arXiv:2311.08677v1 [cs.LG])
    In the rapidly evolving realm of machine learning, algorithm effectiveness often faces limitations due to data quality and availability. Traditional approaches grapple with data sharing due to legal and privacy concerns. The federated learning framework addresses this challenge. Federated learning is a decentralized approach where model training occurs on client sides, preserving privacy by keeping data localized. Instead of sending raw data to a central server, only model updates are exchanged, enhancing data security. We apply this framework to Sparse Principal Component Analysis (SPCA) in this work. SPCA aims to attain sparse component loadings while maximizing data variance for improved interpretability. Beside the L1 norm regularization term in conventional SPCA, we add a smoothing function to facilitate gradient-based optimization methods. Moreover, in order to improve computational efficiency, we introduce a least squares approximation to original SPCA. This enables analytic solutions on the optimization processes, leading to substantial computational improvements. Within the federated framework, we formulate SPCA as a consensus optimization problem, which can be solved using the Alternating Direction Method of Multipliers (ADMM). Our extensive experiments involve both IID and non-IID random features across various data owners. Results on synthetic and public datasets affirm the efficacy of our federated SPCA approach.  ( 2 min )
    Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations. (arXiv:2311.08815v1 [cs.LG])
    Self-supervised representation learning often uses data augmentations to induce some invariance to "style" attributes of the data. However, with downstream tasks generally unknown at training time, it is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded. To address this, we introduce a more principled approach that seeks to disentangle style features rather than discard them. The key idea is to add multiple style embedding spaces where: (i) each is invariant to all-but-one augmentation; and (ii) joint entropy is maximized. We formalize our structured data-augmentation procedure from a causal latent-variable-model perspective, and prove identifiability of both content and (multiple blocks of) style variables. We empirically demonstrate the benefits of our approach on synthetic datasets and then present promising but limited results on ImageNet.  ( 2 min )
    Mean-field variational inference with the TAP free energy: Geometric and statistical properties in linear models. (arXiv:2311.08442v1 [math.ST])
    We study mean-field variational inference in a Bayesian linear model when the sample size n is comparable to the dimension p. In high dimensions, the common approach of minimizing a Kullback-Leibler divergence from the posterior distribution, or maximizing an evidence lower bound, may deviate from the true posterior mean and underestimate posterior uncertainty. We study instead minimization of the TAP free energy, showing in a high-dimensional asymptotic framework that it has a local minimizer which provides a consistent estimate of the posterior marginals and may be used for correctly calibrated posterior inference. Geometrically, we show that the landscape of the TAP free energy is strongly convex in an extensive neighborhood of this local minimizer, which under certain general conditions can be found by an Approximate Message Passing (AMP) algorithm. We then exhibit an efficient algorithm that linearly converges to the minimizer within this local neighborhood. In settings where it is conjectured that no efficient algorithm can find this local neighborhood, we prove analogous geometric properties for a local minimizer of the TAP free energy reachable by AMP, and show that posterior inference based on this minimizer remains correctly calibrated.  ( 2 min )
    A Unified Approach to Learning Ising Models: Beyond Independence and Bounded Width. (arXiv:2311.09197v1 [cs.LG])
    We revisit the problem of efficiently learning the underlying parameters of Ising models from data. Current algorithmic approaches achieve essentially optimal sample complexity when given i.i.d. samples from the stationary measure and the underlying model satisfies "width" bounds on the total $\ell_1$ interaction involving each node. We show that a simple existing approach based on node-wise logistic regression provably succeeds at recovering the underlying model in several new settings where these assumptions are violated: (1) Given dynamically generated data from a wide variety of local Markov chains, like block or round-robin dynamics, logistic regression recovers the parameters with optimal sample complexity up to $\log\log n$ factors. This generalizes the specialized algorithm of Bresler, Gamarnik, and Shah [IEEE Trans. Inf. Theory'18] for structure recovery in bounded degree graphs from Glauber dynamics. (2) For the Sherrington-Kirkpatrick model of spin glasses, given $\mathsf{poly}(n)$ independent samples, logistic regression recovers the parameters in most of the known high-temperature regime via a simple reduction to weaker structural properties of the measure. This improves on recent work of Anari, Jain, Koehler, Pham, and Vuong [ArXiv'23] which gives distribution learning at higher temperature. (3) As a simple byproduct of our techniques, logistic regression achieves an exponential improvement in learning from samples in the M-regime of data considered by Dutt, Lokhov, Vuffray, and Misra [ICML'21] as well as novel guarantees for learning from the adversarial Glauber dynamics of Chin, Moitra, Mossel, and Sandon [ArXiv'23]. Our approach thus significantly generalizes the elegant analysis of Wu, Sanghavi, and Dimakis [Neurips'19] without any algorithmic modification.  ( 3 min )
    ExpM+NF: Differentially Private Machine Learning that Surpasses DPSGD. (arXiv:2311.09200v1 [stat.ML])
    In this pioneering work we formulate ExpM+NF, a method for training machine learning (ML) on private data with pre-specified differentially privacy guarantee $\varepsilon>0, \delta=0$, by using the Exponential Mechanism (ExpM) and an auxiliary Normalizing Flow (NF). We articulate theoretical benefits of ExpM+NF over Differentially Private Stochastic Gradient Descent (DPSGD), the state-of-the-art (SOTA) and de facto method for differentially private ML, and we empirically test ExpM+NF against DPSGD using the SOTA implementation (Opacus with PRV accounting) in multiple classification tasks on the Adult Dataset (census data) and MIMIC-III Dataset (electronic healthcare records) using Logistic Regression and GRU-D, a deep learning recurrent neural network with ~20K-100K parameters. In all experiments, ExpM+NF achieves greater than 93% of the non-private training accuracy (AUC) for $\varepsilon \in [1\mathrm{e}{-3}, 1]$, exhibiting greater accuracy (higher AUC) and privacy (lower $\varepsilon$ with $\delta=0$) than DPSGD. Differentially private ML generally considers $\varepsilon \in [1,10]$ to maintain reasonable accuracy; hence, ExpM+NF's ability to provide strong accuracy for orders of magnitude better privacy (smaller $\varepsilon$) substantially pushes what is currently possible in differentially private ML. Training time results are presented showing ExpM+NF is comparable to (slightly faster) than DPSGD. Code for these experiments will be provided after review. Limitations and future directions are provided.  ( 2 min )
    The Cellwise Minimum Covariance Determinant Estimator. (arXiv:2207.13493v2 [stat.ME] UPDATED)
    The usual Minimum Covariance Determinant (MCD) estimator of a covariance matrix is robust against casewise outliers. These are cases (that is, rows of the data matrix) that behave differently from the majority of cases, raising suspicion that they might belong to a different population. On the other hand, cellwise outliers are individual cells in the data matrix. When a row contains one or more outlying cells, the other cells in the same row still contain useful information that we wish to preserve. We propose a cellwise robust version of the MCD method, called cellMCD. Its main building blocks are observed likelihood and a penalty term on the number of flagged cellwise outliers. It possesses good breakdown properties. We construct a fast algorithm for cellMCD based on concentration steps (C-steps) that always lower the objective. The method performs well in simulations with cellwise outliers, and has high finite-sample efficiency on clean data. It is illustrated on real data with visualizations of the results.  ( 2 min )
    Estimating Appearance Models for Image Segmentation via Tensor Factorization. (arXiv:2208.07853v2 [cs.CV] UPDATED)
    Image Segmentation is one of the core tasks in Computer Vision and solving it often depends on modeling the image appearance data via the color distributions of each it its constituent regions. Whereas many segmentation algorithms handle the appearance models dependence using alternation or implicit methods, we propose here a new approach to directly estimate them from the image without prior information on the underlying segmentation. Our method uses local high order color statistics from the image as an input to tensor factorization-based estimator for latent variable models. This approach is able to estimate models in multiregion images and automatically output the regions proportions without prior user interaction, overcoming the drawbacks from a prior attempt to this problem. We also demonstrate the performance of our proposed method in many challenging synthetic and real imaging scenarios and show that it leads to an efficient segmentation algorithm.  ( 2 min )
    Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev and Besov Spaces. (arXiv:2211.14400v4 [stat.ML] UPDATED)
    Let $\Omega = [0,1]^d$ be the unit cube in $\mathbb{R}^d$. We study the problem of how efficiently, in terms of the number of parameters, deep neural networks with the ReLU activation function can approximate functions in the Sobolev spaces $W^s(L_q(\Omega))$ and Besov spaces $B^s_r(L_q(\Omega))$, with error measured in the $L_p(\Omega)$ norm. This problem is important when studying the application of neural networks in a variety of fields, including scientific computing and signal processing, and has previously been solved only when $p=q=\infty$. Our contribution is to provide a complete solution for all $1\leq p,q\leq \infty$ and $s > 0$ for which the corresponding Sobolev or Besov space compactly embeds into $L_p$. The key technical tool is a novel bit-extraction technique which gives an optimal encoding of sparse vectors. This enables us to obtain sharp upper bounds in the non-linear regime where $p > q$. We also provide a novel method for deriving $L_p$-approximation lower bounds based upon VC-dimension when $p < \infty$. Our results show that very deep ReLU networks significantly outperform classical methods of approximation in terms of the number of parameters, but that this comes at the cost of parameters which are not encodable.  ( 2 min )
    Arbitrariness and Prediction: The Confounding Role of Variance in Fair Classification. (arXiv:2301.11562v6 [cs.LG] UPDATED)
    Variance in predictions across different trained models is a significant, under-explored source of error in fair binary classification. In practice, the variance on some data examples is so large that decisions can be effectively arbitrary. To investigate this problem, we take an experimental approach and make four overarching contributions: We: 1) Define a metric called self-consistency, derived from variance, which we use as a proxy for measuring and reducing arbitrariness; 2) Develop an ensembling algorithm that abstains from classification when a prediction would be arbitrary; 3) Conduct the largest to-date empirical study of the role of variance (vis-a-vis self-consistency and arbitrariness) in fair binary classification; and, 4) Release a toolkit that makes the US Home Mortgage Disclosure Act (HMDA) datasets easily usable for future research. Altogether, our experiments reveal shocking insights about the reliability of conclusions on benchmark datasets. Most fair binary classification benchmarks are close-to-fair when taking into account the amount of arbitrariness present in predictions -- before we even try to apply any fairness interventions. This finding calls into question the practical utility of common algorithmic fairness methods, and in turn suggests that we should reconsider how we choose to measure fairness in binary classification.  ( 3 min )
    Deep Neural Network Identification of Limnonectes Species and New Class Detection Using Image Data. (arXiv:2311.08661v1 [stat.ML])
    As is true of many complex tasks, the work of discovering, describing, and understanding the diversity of life on Earth (viz., biological systematics and taxonomy) requires many tools. Some of this work can be accomplished as it has been done in the past, but some aspects present us with challenges which traditional knowledge and tools cannot adequately resolve. One such challenge is presented by species complexes in which the morphological similarities among the group members make it difficult to reliably identify known species and detect new ones. We address this challenge by developing new tools using the principles of machine learning to resolve two specific questions related to species complexes. The first question is formulated as a classification problem in statistics and machine learning and the second question is an out-of-distribution (OOD) detection problem. We apply these tools to a species complex comprising Southeast Asian stream frogs (Limnonectes kuhlii complex) and employ a morphological character (hind limb skin texture) traditionally treated qualitatively in a quantitative and objective manner. We demonstrate that deep neural networks can successfully automate the classification of an image into a known species group for which it has been trained. We further demonstrate that the algorithm can successfully classify an image into a new class if the image does not belong to the existing classes. Additionally, we use the larger MNIST dataset to test the performance of our OOD detection algorithm. We finish our paper with some concluding remarks regarding the application of these methods to species complexes and our efforts to document true biodiversity. This paper has online supplementary materials.  ( 3 min )
    Statistical learning by sparse deep neural networks. (arXiv:2311.08845v1 [math.ST])
    We consider a deep neural network estimator based on empirical risk minimization with l_1-regularization. We derive a general bound for its excess risk in regression and classification (including multiclass), and prove that it is adaptively nearly-minimax (up to log-factors) simultaneously across the entire range of various function classes.  ( 2 min )
    Discrete-time Competing-Risks Regression with or without Penalization. (arXiv:2303.01186v2 [stat.ME] UPDATED)
    Many studies employ the analysis of time-to-event data that incorporates competing risks and right censoring. Most methods and software packages are geared towards analyzing data that comes from a continuous failure time distribution. However, failure-time data may sometimes be discrete either because time is inherently discrete or due to imprecise measurement. This paper introduces a novel estimation procedure for discrete-time survival analysis with competing events. The proposed approach offers two key advantages over existing procedures: first, it expedites the estimation process for a large number of unique failure time points; second, it allows for straightforward integration and application of widely used regularized regression and screening methods. We illustrate the benefits of our proposed approach by conducting a comprehensive simulation study. Additionally, we showcase the utility of our procedure by estimating a survival model for the length of stay of patients hospitalized in the intensive care unit, considering three competing events: discharge to home, transfer to another medical facility, and in-hospital death.  ( 2 min )
    On semi-supervised estimation using exponential tilt mixture models. (arXiv:2311.08504v1 [stat.ML])
    Consider a semi-supervised setting with a labeled dataset of binary responses and predictors and an unlabeled dataset with only the predictors. Logistic regression is equivalent to an exponential tilt model in the labeled population. For semi-supervised estimation, we develop further analysis and understanding of a statistical approach using exponential tilt mixture (ETM) models and maximum nonparametric likelihood estimation, while allowing that the class proportions may differ between the unlabeled and labeled data. We derive asymptotic properties of ETM-based estimation and demonstrate improved efficiency over supervised logistic regression in a random sampling setup and an outcome-stratified sampling setup previously used. Moreover, we reconcile such efficiency improvement with the existing semiparametric efficiency theory when the class proportions in the unlabeled and labeled data are restricted to be the same. We also provide a simulation study to numerically illustrate our theoretical findings.  ( 2 min )
    Uplift Modeling based on Graph Neural Network Combined with Causal Knowledge. (arXiv:2311.08434v1 [cs.LG])
    Uplift modeling is a fundamental component of marketing effect modeling, which is commonly employed to evaluate the effects of treatments on outcomes. Through uplift modeling, we can identify the treatment with the greatest benefit. On the other side, we can identify clients who are likely to make favorable decisions in response to a certain treatment. In the past, uplift modeling approaches relied heavily on the difference-in-difference (DID) architecture, paired with a machine learning model as the estimation learner, while neglecting the link and confidential information between features. We proposed a framework based on graph neural networks that combine causal knowledge with an estimate of uplift value. Firstly, we presented a causal representation technique based on CATE (conditional average treatment effect) estimation and adjacency matrix structure learning. Secondly, we suggested a more scalable uplift modeling framework based on graph convolution networks for combining causal knowledge. Our findings demonstrate that this method works effectively for predicting uplift values, with small errors in typical simulated data, and its effectiveness has been verified in actual industry marketing data.  ( 2 min )
    Structured Estimation of Heterogeneous Time Series. (arXiv:2311.08658v1 [stat.ME])
    How best to model structurally heterogeneous processes is a foundational question in the social, health and behavioral sciences. Recently, Fisher et al., (2022) introduced the multi-VAR approach for simultaneously estimating multiple-subject multivariate time series characterized by common and individualizing features using penalized estimation. This approach differs from many popular modeling approaches for multiple-subject time series in that qualitative and quantitative differences in a large number of individual dynamics are well-accommodated. The current work extends the multi-VAR framework to include new adaptive weighting schemes that greatly improve estimation performance. In a small set of simulation studies we compare adaptive multi-VAR with these new penalty weights to common alternative estimators in terms of path recovery and bias. Furthermore, we provide toy examples and code demonstrating the utility of multi-VAR under different heterogeneity regimes using the multivar package for R (Fisher, 2022).  ( 2 min )
    Limitations of neural network training due to numerical instability of backpropagation. (arXiv:2210.00805v4 [cs.LG] UPDATED)
    We study the training of deep neural networks by gradient descent where floating-point arithmetic is used to compute the gradients. In this framework and under realistic assumptions, we demonstrate that it is highly unlikely to find ReLU neural networks that maintain, in the course of training with gradient descent, superlinearly many affine pieces with respect to their number of layers. In virtually all approximation theoretical arguments that yield high-order polynomial rates of approximation, sequences of ReLU neural networks with exponentially many affine pieces compared to their numbers of layers are used. As a consequence, we conclude that approximating sequences of ReLU neural networks resulting from gradient descent in practice differ substantially from theoretically constructed sequences. The assumptions and the theoretical results are compared to a numerical study, which yields concurring results.  ( 2 min )
    Model Agnostic Explainable Selective Regression via Uncertainty Estimation. (arXiv:2311.09145v1 [cs.LG])
    With the wide adoption of machine learning techniques, requirements have evolved beyond sheer high performance, often requiring models to be trustworthy. A common approach to increase the trustworthiness of such systems is to allow them to refrain from predicting. Such a framework is known as selective prediction. While selective prediction for classification tasks has been widely analyzed, the problem of selective regression is understudied. This paper presents a novel approach to selective regression that utilizes model-agnostic non-parametric uncertainty estimation. Our proposed framework showcases superior performance compared to state-of-the-art selective regressors, as demonstrated through comprehensive benchmarking on 69 datasets. Finally, we use explainable AI techniques to gain an understanding of the drivers behind selective regression. We implement our selective regression method in the open-source Python package doubt and release the code used to reproduce our experiments.  ( 2 min )

  • Open

    [P] Building the largest open-source prompt dataset, looking for contributors!
    Hi all. I am trying to build the largest open-source prompt dataset on GitHub. Plan is to split the prompts by industries, models and techniques. https://github.com/kortex-labs/awesome-llm-prompts I realize it's impossible to do this alone and therefore I am reaching out to y'all for help 🤝🥹 submitted by /u/kanxx030 [link] [comments]  ( 1 min )
    [P] Roast my code
    I recently implemented a Siamese network with triplet loss, incorporating online mining for triple selection. I'm seeking feedback on the model's performance and my code. The Flowers102 dataset was employed, aiming to determine whether two given flower images belong to the same species. ​ Despite achieving a 90% accuracy rate, I observed an unusual need for a low threshold (0.25) during inference. I attribute this to the inherent similarity between flower images, suggesting that a lower threshold is more appropriate. This contrasts with face verification tasks, which typically require higher thresholds. I would appreciate insights or suggestions regarding this observation. submitted by /u/Ok_Ad_1034 [link] [comments]  ( 1 min )
    [D] Open Source Simulator for NVIDIA A100 GPU Architecture
    Hi everyone! For my project, I am looking for an open-source GPU simulator supporting NVIDIA A100 GPU architecture. I would really appreciate it if you could point me to any relevant resources. Apologies if my question is off-topic. Considering the ML-GPU connection, I thought it might be relevant here. Thank you! submitted by /u/ramya_1995 [link] [comments]  ( 1 min )
    [D] What are some methods to combine Knowledge Graph and LLM?
    currently, i am working on a Knowledge Graph to improve LLM performance in the practice Level .. in my First Stage, I am wondering what is the better way to Process the Dataset as knowledge Graph,? - do I should use Prompt Engineering while representing KG as a Query schema? - may i could use RAG directly and set KG as a knowledge Database, my main question here is what are the most common techniques used to tokenize KG? or there's a possibility to use Top-Layer based on GNN to handle exrraction context from KG i am sorry if my Qs are not clear because i am just starting my journey in this problem submitted by /u/Youness_Elbrag [link] [comments]  ( 1 min )
    [D] Scaling object detection and classification networks (e.g. RetinaNet)
    Mostly state-of-the-art detection and classification networks (RetinaNet, EfficientDet) follow the same approach of applying CNN backbone and feeding last X layers to (Bi-)FPN. Similar backbone is used by Tesla in HydraNet (RegNet + BiFPN). ​ The question is the following: since many papers benchmark the networks on "low" resolution image databases, how would one approach scaling the network to higher res images? On the one hand, you could still apply the same architecture (resolution of each layer will be bigger though), but wouldn't the perceptive field of each feature be limited (relative to the image size)? Or would you train a new (deeper) CNN backbone for that? ​ Edit: to keep it simple, could I take for instance vanilla RetinaNet with pretrained weights from Pytorch and simply use it on 2-5MP images? Or the performance will drop, especially for objects spanning big part of field-of-view? submitted by /u/Spare-Help562 [link] [comments]  ( 1 min )
    [P] Structured EDA Commentary
    Hi All, I'm currently automating a data pipeline and was hoping to explore using ML for EDA commentary. I have structured data with headers and looking to generate top down granular commentary based on the data headers given in which they are either referencing unique string values and their corresponding numerical data header value. Was hoping anyone could point me in the right direction as to the structure or approach I could take as I'm still learning. Thanks. submitted by /u/One_Station_1762 [link] [comments]  ( 1 min )
    Hosting RL Challenges on Kaggle [D]
    Back in 2020, Kaggle released a new challenge category called "[Simulation competitions](https://www.kaggle.com/simulations)" allowing for reinforcement learning solutions. [ConnectX](https://www.kaggle.com/c/connectx) is perhaps the most popular example of this. I thought that since ConnectX allowed for submitting .py files instead of .csv, there should be a way of extending/modifying ConnectX to host my own reinforcement learning challenge for a class I am teaching. However, I could not find any way of hosting such a challenge myself using kaggle's platform. Does anyone here has any experience with this? submitted by /u/Brown_bagheera [link] [comments]  ( 1 min )
    [N] (In)validating published ML performance scores is possible
    I think many of us came across publications with ML performance scores (such as accuracy, sensitivity, etc.) that seemed unrealistic, still getting a lot of citations. We knew they were false, invalid, incorrect, maybe a typo, maybe some bug in the evaluation, maybe cheating, but no one invested the time to reimplement the method and prove it wrong. However, ML performance scores - especially if multiple ones are reported - cannot take any values independently. There are numerous constraints imposed on them by the experimental setup that they need to satisfy. We just released a Python package with the capability to identify inconsistencies regarding published ML performance scores given the description of the experimental setup. https://github.com/FalseNegativeLab/mlscorecheck In the corresponding paper we already identified some phishy things going on in the literature of skin lesion classification. What do you think? submitted by /u/gykovacs [link] [comments]
    [D] Bringing ML code to the data(base) to build better AI apps
    I built the open-source ML platform at Instacart a few years ago. I learned a ton, but primarily that it's better to bring your ML workload to the database rather than bringing the data to the code. It takes a lot of the complexity out of your infra, and it's ultimately faster for your users. That's why we made PostgresML. It's an open-source extension for PostgreSQL. Combine it with pgvector and you've got a complete ML platform with just a few extensions. We're bullish on the power of in-database and open-source ML/AI, both traditional tabular data applications, and novel NLP LLM RAG applications that combine retrieval and inference in a single process space to minimize latency and operational complexity. I'd love to get your thoughts on our approach. You can mess around with it on our site: https://postgresml.org/ Let me know what you think think. submitted by /u/something_cleverer [link] [comments]  ( 1 min )
    [D] Embedding Converter; need help with architecture
    I am an undergrad, and for a research project, I have been tasked with building an embedding converter. That is, given an input sentence (ie "This is a sentence."), I want to provide a model that will convert the embedding of this in 1 LLM to another. I have tried using a standard deterministic Neural Network, however, I believe that my accuracy is being limited by a few things. Embeddings seem to be generated probabilistically for most LLMs based on context, and so my deterministic architecture will inevitably fail. I've looked into building a VAE (Variational Autoencoder) to help this. Basically, given an input sentence, the encoder would convert 1 LLM embedding to another, and the decoder would reconstruct that original sentence. However, I'm having some difficulty understanding a few things: What would my latent space be for the encoder? What would my "converter" be? If it is the encoder, should my latent space then be in the size of my LLM2 embedding? This all seems a bit hand wavy to me, and I don't really know if this is a valid usage of a VAE. Do you recommend another way of dealing with this? Also, if anyone knows of any easily parseable large datasets (grammatically correct sentences that can be separated easily, ideally by line), let me know! I've looked into using the Pile, but I can't find the download link for a small part of their huge dataset. submitted by /u/sashajh12 [link] [comments]  ( 1 min )
    [D] Retraining the best model of every session
    Dear ML community, Im quite new to ML but doing this as part of my study and interest. I have experimented a bit with training models etc. but was confused about the following: Would it be a possibility to take the best performing model from a session, and retraining that one for your new session? Would this perform better or have at least a higher chance of performing better? Would it follow the exact same path? Would it work if you would feed it different data then? ​ I hope this isn't a dumb question! submitted by /u/tommenomnom [link] [comments]  ( 1 min )
    [R] Meta Announces Emu Edit: Precise Image Editing via Recognition and Generation Tasks
    Researchers at Meta AI announced Emu Edit today. It can edit images precisely based on text instructions. It's a big advance for "instructable" image editing. Existing systems struggle to interpret instructions correctly - making imprecise edits or changing the wrong parts of images. Emu Edit tackles this through multi-task training. They trained it on 16 diverse image editing and vision tasks like object removal, style transfer, segmentation etc. Emu Edit learns unique "task embeddings" to guide it towards suitable edits based on the instruction text. Like a "texture change" vs "object removal". In evaluations, Emu Edit significantly outperformed prior systems like InstructPix2Pix on following instructions faithfully while preserving unrelated image regions. With just a few examples, it can adapt to wholly new tasks like image inpainting by updating the task embedding rather than full retraining. There's still room for improvement on complex instructions. But Emu Edit demonstrates how multi-task training can majorly boost AI editing abilities. It's now much closer to human-level performance on translating natural language to precise visual edits. TLDR: Emu Edit uses multi-task training on diverse edits/vision tasks and task embeddings to achieve big improvements in instruction-based image editing fidelity. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]  ( 1 min )
    [D] Where does the 'ML' part of CAUSAL ML come in?
    Looks like classic causal inference, the type would see in Pearl or Econometrics. I don't understand where the ML comes in. submitted by /u/Ckdew [link] [comments]  ( 1 min )
    [N] Microsoft Releases SynapseML v1.0
    Today Microsoft announced the release and general availability of SynapseML v1.0 following seven years of continuous development. SynapseML is an open-source library that aims to streamline the development of massively scalable machine learning pipelines. It unifies several existing ML Frameworks and new Microsoft algorithms in a single, scalable API that is usable across Python, R, Scala, and Java. SynapseML is usable from any Apache Spark platform (or even your laptop) and is now generally available with enterprise support on Microsoft Fabric. To learn more: Release Notes: https://github.com/microsoft/SynapseML/releases/tag/v1.0.0 Website: https://aka.ms/spark Thank you to all the contributors in the community who made the release possible! ​ SynapseML v1.0 submitted by /u/mhamilton723 [link] [comments]  ( 1 min )
    Best model choice to finetune on new classes for object detection? [D]
    Let’s say I have trained my OD on the Coco dataset. Which choice of OD would be the best to finetune my model on a limited dataset of additional classes in a specific domain of images (Let’s say 1000 instances per class). Would Yolov8 be a good choice? Are there even any benchmarks for this? Should I just look for the highest AP on Coco/Pascal? submitted by /u/WingedTorch [link] [comments]  ( 1 min )
    [P] Sharing my notes on research papers/machine learning
    Hi everyone, I read research papers pretty often and take detailed notes to make sure I fully understand the paper. I was thinking that these notes would be useful for other people as well since I have found similar types of posts useful for myself. I just published my notes for MotionLM (Multi-Agent Motion Forecasting as Language Modeling) and would really appreciate some feedback on whether the notes are useful and worth continuing to share or if it's not that useful for other people. Here's the link to my notes: https://publish.obsidian.md/ericwiener/Research+Papers/Multi-Agent+Motion+Forecasting+as+Language+Modeling Thanks so much! Eric submitted by /u/EricW_CS [link] [comments]  ( 1 min )
    [D] A conceptual precursor to today's language machines
    https://hedgehogreview.com/issues/markets-and-the-good/artic... Some ~72 years ago in 1951, Claude Shannon released his Paper, "Prediction and Entropy of Printed English", an extremely fascinating read now. It begins with a game. Claude pulls a book down from the shelf, concealing the title in the process. After selecting a passage at random, he challenges his wife, Mary to guess its contents letter by letter. The space between words will count as a twenty-seventh symbol in the set. If Mary fails to guess a letter correctly, Claude promises to supply the right one so that the game can continue. In some cases, a corrected mistake allows her to fill in the remainder of the word; elsewhere a few letters unlock a phrase. All in all, she guesses 89 of 129 possible letters correctly—69 percen…  ( 1 min )
    [P] LoRAX: Open Source Framework for Serving 100s of Fine-Tuned LLMs in Production
    LoRAX (LoRA eXchange) is a framework that allows users to serve over a hundred fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency. Any thoughts? submitted by /u/jxz2101 [link] [comments]  ( 1 min )
    [D] Dragonfly ORS Image ROIs - Auto-Segmentation
    This is my first time posting on Reddit and I was unsure what would be the most appropriate subreddit to gauge, but the subreddit for the specific ORS I use is incredibly inactive, and its dedicated forums are not much better. I am working on Dragonfly ORS to segment long elements, specifically in the wing, of bats from micro-CT scans for my graduate thesis and am running into a problem with the sheer amount of time it takes to get through my dataset when manually segmenting each element (humerus, forearm,the thumb, digits 1-5 separated into metacarpal and individual phalanges, excluding carpals). I know there is a machine learning application to the ORS, but I am interested if anyone knows how to navigate any of the auto-segmentation aspects of the program. The point and click tool was what i was suggested to use, but anytime I attempt to use that it does not isolate elements to the the individual bones I am after (it will do an entire wings + scapula, for example, but I don't know how to break it down into the bones I need fully separated). I have tried to flank the ROI's and fill in the middle, but I am also unsure of how to ignore existing ROIs in order to fill in and combine to segment out an individual bone. Hopefully this inquiry makes sense. Essentially, I am seeing my options for how to segment out these bones into regions of interest in the fastest way possible that would allow more time just for clean-up as opposed to having to manually fill in (even with multi-slice) EVERY bone from scratch and then clean up. I don't know how popular this image tool is, but I would love any input!, even if just how to use Dragonfly for more effective segmentation! submitted by /u/SugarK_tastic [link] [comments]  ( 1 min )
    [Discussion] What are best practices when building/training very small models?
    I am currently trying to build small convolutional regression models with very tight constraints regarding model size (max. a few thousand parameters). Are there any rules of thumb/gold standards/best practices to consider here? E.g. should I prefer depth of the model over width, do skip connections add anything in these small scales, are there any special training hacks that might boost performance, etc? Any hints or pointers, where to look are greatly appreciated. submitted by /u/Snagnar [link] [comments]  ( 1 min )
    [P] My Physics Informed Neural Network repo in Github
    Hello everyone Feel free to view and review my repo, heat-pinn, in which I have solved a 2D steady state heat equation using ML on two different geometries. Cheers! https://github.com/314arhaam/heat-pinn/ submitted by /u/Professional-Ant9045 [link] [comments]  ( 1 min )
    Will machine learning be known as bad for the environment? [D]
    In my masters degree I always ran many computations as did all my peers The reality is that more of us are than not are using huge HPC clusters / cloud computing for many hours on each project The industry is just GPUs going BRRR I’m wondering if this has potential implications for ML in society as AI/ML becomes more mainstream I could see this narrative being easily played in legacy media Ps - yeah while there are researchers trying to make things more efficient, the general trend is that we are using more GPU hours per year in order to continue innovation at the forefront of artificial inference submitted by /u/Ok_Reality2341 [link] [comments]  ( 1 min )
    [D] How do you guys in the industry deal with data scarcity?
    I was recently trying to work on a ML project and realized that the I need very specific type of data for what I am trying to achieve which makes me wonder how do people and companies deal with data scarcity, I am pretty sure they at some point need very specific type of data which isn't easily available submitted by /u/BadKarma-18 [link] [comments]  ( 1 min )
    [D] Why are ML model outputs not tested regarding statistical significance?
    Often when I read ML papers the authors compare their results against a benchmark (e.g. using RMSE, accuracy, ...) and say "our results improved with our new method by X%". Nobody makes a significance test if the new method Y outperforms benchmark Z. Is there a reason why? Especially when you break your results down e.g. to the anaylsis of certain classes in object classification this seems important for me. Or do I overlook something? submitted by /u/Tigmib [link] [comments]  ( 1 min )
    [N] Trillion Parameter Consortium (TPC) - TPC Announced with Founding Partners
    Announcement: https://tpc.dev/2023/11/10/tpc-announced-with-founding-partners/ Participating Organizations: https://tpc.dev/participants/ ​ "The Trillion Parameter Consortium (TPC) brings together teams of researchers engaged in creating large-scale generative AI models to address key challenges in advancing AI for science. These challenges include developing scalable model architectures and training strategies, organizing, and curating scientific data for training models; optimizing AI libraries for current and future exascale computing platforms; and developing deep evaluation platforms to assess progress on scientific task learning and reliability and trust." ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [R] Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks - Microsoft 2023
    Paper: https://arxiv.org/abs/2311.06242 Abstract: We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities. https://preview.redd.it/hu9ouslxpl0c1.png?width=1217&format=png&auto=webp&s=06bd12f7b4ed22293523af3d65f9ed288f88e17a https://preview.redd.it/wc5pmolxpl0c1.png?width=1223&format=png&auto=webp&s=7d33386317a0fdd72f480e125bbd7a28aba73ad5 submitted by /u/APaperADay [link] [comments]  ( 9 min )
  • Open

    How to build a custom AI chat bot that encompasses Claude 2, Chat GPT, our own data, and other libraries.
    Hi, can anyone tell me tips, or send tutorial links as to how i can create an AI chatbot, in which I feed it documents, and have it use portions of Chat GPT, Claude 2, and other LLM/libraries? I am builiding this for myself so i can have an assistant that thinks in a particular way. ​ thanks in advance! submitted by /u/Popular_Ninja_6317 [link] [comments]
    Dreams of Electric Sheep
    I made some music using Suno.ai inspired by Philip K Dick. Here’s a short video I made to go along with it. I hope you enjoy! submitted by /u/Exitium_Maximus [link] [comments]
    Playing INWORLD Origins - NPC Generative AI Experience in 4K
    submitted by /u/erikmalkavian [link] [comments]
    Seeking a civic-minded Gen-AI expert interested in co-founding/advising a startup with a Harvard MBA student.
    Hi all! I'm currently a 2nd year student at Harvard Business School. I have a vision for a transformative Gen-AI application in local municipal governments the requires development of a private LLM. I'm planning to submit my idea to Harvard's New Venture Competition, but would love to bring on a technical co-founder or advisor to help roadmap the vision. Please shoot me a PM if interested! submitted by /u/thehurd03 [link] [comments]
    Screenshot to Code GPT
    submitted by /u/Senior_tasteey [link] [comments]
    Entertaining video on AI safety (in disguise as a video about wishes) and why it might not be such a good idea to be so dismissive of the dangers of AGIs.
    submitted by /u/IMightBeAHamster [link] [comments]
    Transforming the future of music creation
    submitted by /u/bartturner [link] [comments]
    Transforming the future of music creation
    submitted by /u/bartturner [link] [comments]
    Transforming the future of music creation
    submitted by /u/bartturner [link] [comments]
    Microsoft is finally making custom chips — and they’re all about AI
    submitted by /u/vjmde [link] [comments]
    Weird & Wild LLM Interactions
    Ever had an LLM conversation that left you scratching your head? Share the weirdest or wildest moments - cryptic convos, spill 'em! submitted by /u/redblade678 [link] [comments]
    Creative Collisions: LLMs and Human Imagination
    In a nutshell, do you think LLMs are sparking creativity or squashing it? Let's exchange lightning-fast thoughts on the collision of LLMs and human imagination. submitted by /u/redblade678 [link] [comments]
    Ethics Check: LLMs and the Slippery Slope
    What's your take on the ethical tightrope that comes with LLMs? Are we on a slippery slope? Share your quick thoughts! submitted by /u/redblade678 [link] [comments]
    Who will you buy dinner from?
    submitted by /u/Sea_Permit5660 [link] [comments]
    Share your favorite before-and-after examples using Topaz Labs' AI models
    Let's see the transformative power of AI in action! submitted by /u/cherishjoo [link] [comments]
    Will AI really take over the world one day or is it juts a phase?
    Do you think AI will really take over the world, or is it just a phase? We saw the same thing happening the 20th century when people imagined there would be flying cars just because it was trend back then. submitted by /u/redblade678 [link] [comments]
    What's the most weird output you've got on an LLM?
    Title submitted by /u/redblade678 [link] [comments]
    The Bias Conundrum: Are LLMs Playing Favorites
    Let's talk about the elephant in the room: bias. Critics argue that LLMs, despite their sophistication, can perpetuate and even amplify existing biases present in their training data. Are these models inadvertently favoring certain groups, and how does this impact the information they generate. submitted by /u/redblade678 [link] [comments]
    ASI really possible?
    Will ASI really be possible? beacuse it seems to me that they are just fancy word predictors, and to really achieve ASI we need to make more breakthroughs in AI. submitted by /u/redblade678 [link] [comments]
    One-Minute Daily AI News 11/15/2023
    Adobe is working on a new audio tool designed to break apart different layers of sound within a single recording. The tool is called Project Sound Lift, and it can use AI to separate elements like applause from the sound of someone’s voice.[1] Microsoft Offers AI-Powered Customer Service For Blind Users.[2] Microsoft and Google will not challenge an EU law requiring them to make it easier for users to move between competing services such as social media platforms and internet browsers.[3] OpenAI has temporarily stopped people from signing up to the premium version of ChatGPT, after it proved so popular the company was unable to operate it.[4] Sources: [1] https://www.theverge.com/2023/11/15/23962452/adobe-project-sound-lift-ai-audio-tool [2] https://www.bloomberg.com/news/articles/2023-11-15/microsoft-to-offer-ai-powered-customer-service-for-blind-users?embedded-checkout=true [3] https://www.reuters.com/technology/microsoft-google-not-challenge-eu-gatekeeper-designation-2023-11-14/ [4] https://uk.news.yahoo.com/chatgpt-plus-openai-stops-premium-175100037.html submitted by /u/Excellent-Target-847 [link] [comments]
    20,000+ Best Custom GPTs Directory
    submitted by /u/Senior_tasteey [link] [comments]
  • Open

    Problem with NN for binary image classification
    Hi, I’m trying to build a NN for binary image classification. I tried with augmentation, transfer learning and also fine tuning, but when I train my model I obtain good accuracy for both train (0.9) and validation (0.85) set, but a high loss function for the validation. Then when I try to use it on a test set, I get a pretty low accuracy (0.6). Do anyone have any advice? Thanks! submitted by /u/Sophia_Gi [link] [comments]  ( 9 min )
    In search of deepfake lip sync AI
    Hello, I would like to ask you if there are any programs like Wav2lip and VideoRetalking, only with good quality and good lip synchronization, I really need such a program, I've been looking for more than a month, unfortunately I haven't found it yet submitted by /u/Embarrassed_Gap_8032 [link] [comments]
  • Open

    Is Reinforcement Learning really used in industry? If so, is it comparable to other forms of NN?
    I'm thinking of specializing in RL while doing a PhD in environmental engineering (more specifically agriculture). The research, together with my interests, led me naturally to RL as a tool to solve problems and achieve interesting research results. But then I started wondering whether it's "worth it" to specialize in this since i intend to work in industry, rather than academia. Hence my question: Is RL really used in some applications in industry? Which ones? If it is, is it at least used comparably as much as supervised or unsupervised learning? Really I'm looking to understand as much as possible how is RL used in industry so whatever you can answer about that would be much appreciated. Thanks! submitted by /u/vniversvs_ [link] [comments]  ( 1 min )
    Reward Design for 1v1 Competition Environment
    I have a multi-agent competitive environment, let's call it "Monarch on the Hill". It is a 1v1 game where the Green Player has a stationary Monarch to defend, and the Blue Player is trying to take down the Monarch. Any collision between entities ends in a kill. Green Player's goal is ultimately to keep the Monarch alive on the hill (centered in the playing space). The action space for each agent is Multi-Discrete - they can accelerate either [-1,0,1] in both the X and Y directions up to a maximum speed of 5 in both directions (i.e., velocity = [-5,5]). There are not obstacles in the environment but there is Out of Bounds on all for sides of the playing space. Green Player wins if it kills Blue Player by colliding with them or if the Monarch is still alive at the end of the episode's time l…  ( 1 min )
    Adding Noise to observations Space in RL
    I'm currently in the process of building a feature selection framework for reinforcement learning. A. Here some previous works that I have seen : https://link.springer.com/chapter/10.1007/978-3-642-15880-3_36 https://ieeexplore.ieee.org/document/5381529 https://ieeexplore.ieee.org/document/4543621 https://www.researchgate.net/publication/221346040_An_analysis_of_linear_models_linear_value-function_approximation_and_feature_selection_for_reinforcement_learning https://www.researchgate.net/publication/221345743_Automatic_basis_function_construction_for_approximate_dynamic_programming_and_reinforcement_learning https://dl.acm.org/doi/10.1145/1273496.1273589 B. Context : --> I am experimenting with the non-continuous lunar lander environment using gym library. And I want to add tw…  ( 1 min )
    Minimal MuJoCo setup for getting started with IK and RL
    dear all, I'm trying to find a minimal setup like two STL files and a joint definition in MuJoCo to learn about Inverse Kinematics. I do have a full robot design as STLs, but I'd like to start small, add IK and then move on to RL. All the MuJoCo demo scenarios seem to have complex robots, but learning from scratch would be cool. Any hints? submitted by /u/assadollahi [link] [comments]  ( 1 min )
    Is there any benefit to using actor-critic methods in very small action space problems?
    So, my RL problem has a very small discrete action space, but big input - the environment is quite complex and only partially observable (so imperfect information). As I understand, the two big differences between value vs policy methods are: policy methods are better for large or continuous action spaces policy methods can do stochastic behaviour, necessary to dealng with imperfect information environment. I don't care about the first one, but I do about the second, therefore I went the policy method route. I did vanilla policy gradients, but they are, of course, unstable and slow to train. So I wanted to do PPO next. But reading more about the existing implementations of it, it seems to me everyone is using PPO in an actor-critic setting and not by itself. Which I'm open to adopting, but I can't help but think - "If i have a neural net that predicts value well, and my action space is like 4, why do I even need a policy then?". Actor-critic makes sense to me in large action spaces, but is there any benefit in small ones? And if not, what would be the better approach for problems with small action spaces but imperfect information? submitted by /u/allegory1100 [link] [comments]  ( 1 min )
    Built-in reinforcement learning functions in Python
    Is stablebaselines3 the best library for reinforcement learning functions in python? Are there better libraries that you would suggest? Any useful links? Thanks submitted by /u/MomoSolar [link] [comments]
  • Open

    AI Training AI: GatorTronGPT at the Forefront of University of Florida’s Medical AI Innovations
    How do you train an AI to understand clinical language with less clinical data? Train another AI to synthesize training data. Artificial intelligence is changing the way medicine is done, and is increasingly being used in all sorts of clinical tasks. This is fueled by generative AI and models like GatorTronGPT, a generative language model Read article >  ( 5 min )
    Three Ways Generative AI Can Bolster Cybersecurity
    Human analysts can no longer effectively defend against the increasing speed and complexity of cybersecurity attacks. The amount of data is simply too large to screen manually. Generative AI, the most transformative tool of our time, enables a kind of digital jiu jitsu. It lets companies shift the force of data that threatens to overwhelm Read article >  ( 6 min )
    Into the Omniverse: OpenUSD Enhancements for Autodesk Maya Make 3D Workflows a Ferret-Tale
    3D artists can improve the productivity and efficiency of their generative AI-enabled content-creation workflows thanks to the latest updates to popular OpenUSD software.  ( 7 min )
    More Games, More Wins: PC Game Pass Included With Six-Month GeForce NOW Memberships
    The fastest way to give the gift of cloud gaming starts this GFN Thursday: For a limited time, every six-month GeForce NOW Ultimate membership includes three months of PC Game Pass. Also, the newest GeForce NOW app update is rolling out to members, including Xbox Game Syncing and more improvements. Plus, take advantage of a Read article >  ( 7 min )
  • Open

    Responsible AI at Google Research: Adversarial testing for generative AI safety
    Posted by Kathy Meier-Hellstern, Building Responsible AI & Data Systems, Director, Google Research The Responsible AI and Human-Centered Technology (RAI-HCT) team within Google Research is committed to advancing the theory and practice of responsible human-centered AI through a lens of culturally-aware research, to meet the needs of billions of users today, and blaze the path forward for a better AI future. The BRAIDS (Building Responsible AI Data and Solutions) team within RAI-HCT aims to simplify the adoption of RAI practices through the utilization of scalable tools, high-quality data, streamlined processes, and novel research with a current emphasis on addressing the unique challenges posed by generative AI (GenAI). GenAI models have enabled unprecedented capabilities leading …  ( 92 min )
  • Open

    Philips accelerates development of AI-enabled healthcare solutions with an MLOps platform built on Amazon SageMaker
    This is a joint blog with AWS and Philips. Philips is a health technology company focused on improving people’s lives through meaningful innovation. Since 2014, the company has been offering customers its Philips HealthSuite Platform, which orchestrates dozens of AWS services that healthcare and life sciences companies use to improve patient care. It partners with […]  ( 15 min )
    Fine-tune Whisper models on Amazon SageMaker with LoRA
    Whisper is an Automatic Speech Recognition (ASR) model that has been trained using 680,000 hours of supervised data from the web, encompassing a range of languages and tasks. One of its limitations is the low-performance on low-resource languages such as Marathi language and Dravidian languages, which can be remediated with fine-tuning. However, fine-tuning a Whisper […]  ( 7 min )
    Use foundation models to improve model accuracy with Amazon SageMaker
    Determining the value of housing is a classic example of using machine learning (ML). In this post, we discuss the use of an open-source model specifically designed for the task of visual question answering (VQA). With VQA, you can ask a question of a photo using natural language and receive an answer to your question—also in plain language. Our goal in this post is to inspire and demonstrate what is possible using this technology.  ( 12 min )
  • Open

    What’s Your Story: Desney Tan
    From service in the Singapore Armed Forces to autonomous navigation with NASA & VR with Disney, Desney Tan’s life journey hasn’t been linear. Learn how Tan landed at Microsoft & about the purpose guiding his work in the podcast series “What’s Your Story”. The post What’s Your Story: Desney Tan appeared first on Microsoft Research.  ( 22 min )
  • Open

    Technique enables AI on edge devices to keep learning over time
    With the PockEngine training method, machine-learning models can efficiently and continuously learn from user data on edge devices like smartphones.  ( 10 min )
  • Open

    Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference. (arXiv:2311.07625v1 [cs.LG])
    Artificial neural networks open up unprecedented machine learning capabilities at the cost of ever growing computational requirements. Sparsifying the parameters, often achieved through weight pruning, has been identified as a powerful technique to compress the number of model parameters and reduce the computational operations of neural networks. Yet, sparse activations, while omnipresent in both biological neural networks and deep learning systems, have not been fully utilized as a compression technique in deep learning. Moreover, the interaction between sparse activations and weight pruning is not fully understood. In this work, we demonstrate that activity sparsity can compose multiplicatively with parameter sparsity in a recurrent neural network model based on the GRU that is designed to be activity sparse. We achieve up to $20\times$ reduction of computation while maintaining perplexities below $60$ on the Penn Treebank language modeling task. This magnitude of reduction has not been achieved previously with solely sparsely connected LSTMs, and the language modeling performance of our model has not been achieved previously with any sparsely activated recurrent neural networks or spiking neural networks. Neuromorphic computing devices are especially good at taking advantage of the dynamic activity sparsity, and our results provide strong evidence that making deep learning models activity sparse and porting them to neuromorphic devices can be a viable strategy that does not compromise on task performance. Our results also drive further convergence of methods from deep learning and neuromorphic computing for efficient machine learning.  ( 3 min )
    A Data-Free Approach to Mitigate Catastrophic Forgetting in Federated Class Incremental Learning for Vision Tasks. (arXiv:2311.07784v1 [cs.LG])
    Deep learning models often suffer from forgetting previously learned information when trained on new data. This problem is exacerbated in federated learning (FL), where the data is distributed and can change independently for each user. Many solutions are proposed to resolve this catastrophic forgetting in a centralized setting. However, they do not apply directly to FL because of its unique complexities, such as privacy concerns and resource limitations. To overcome these challenges, this paper presents a framework for \textbf{federated class incremental learning} that utilizes a generative model to synthesize samples from past distributions. This data can be later exploited alongside the training data to mitigate catastrophic forgetting. To preserve privacy, the generative model is trained on the server using data-free methods at the end of each task without requesting data from clients. Moreover, our solution does not demand the users to store old data or models, which gives them the freedom to join/leave the training at any time. Additionally, we introduce SuperImageNet, a new regrouping of the ImageNet dataset specifically tailored for federated continual learning. We demonstrate significant improvements compared to existing baselines through extensive experiments on multiple datasets.  ( 2 min )
    CL-Flow:Strengthening the Normalizing Flows by Contrastive Learning for Better Anomaly Detection. (arXiv:2311.06794v1 [cs.IR] CROSS LISTED)
    In the anomaly detection field, the scarcity of anomalous samples has directed the current research emphasis towards unsupervised anomaly detection. While these unsupervised anomaly detection methods offer convenience, they also overlook the crucial prior information embedded within anomalous samples. Moreover, among numerous deep learning methods, supervised methods generally exhibit superior performance compared to unsupervised methods. Considering the reasons mentioned above, we propose a self-supervised anomaly detection approach that combines contrastive learning with 2D-Flow to achieve more precise detection outcomes and expedited inference processes. On one hand, we introduce a novel approach to anomaly synthesis, yielding anomalous samples in accordance with authentic industrial scenarios, alongside their surrogate annotations. On the other hand, having obtained a substantial number of anomalous samples, we enhance the 2D-Flow framework by incorporating contrastive learning, leveraging diverse proxy tasks to fine-tune the network. Our approach enables the network to learn more precise mapping relationships from self-generated labels while retaining the lightweight characteristics of the 2D-Flow. Compared to mainstream unsupervised approaches, our self-supervised method demonstrates superior detection accuracy, fewer additional model parameters, and faster inference speed. Furthermore, the entire training and inference process is end-to-end. Our approach showcases new state-of-the-art results, achieving a performance of 99.6\% in image-level AUROC on the MVTecAD dataset and 96.8\% in image-level AUROC on the BTAD dataset.  ( 2 min )
    Multi-Label Topic Model for Financial Textual Data. (arXiv:2311.07598v1 [q-fin.ST])
    This paper presents a multi-label topic model for financial texts like ad-hoc announcements, 8-K filings, finance related news or annual reports. I train the model on a new financial multi-label database consisting of 3,044 German ad-hoc announcements that are labeled manually using 20 predefined, economically motivated topics. The best model achieves a macro F1 score of more than 85%. Translating the data results in an English version of the model with similar performance. As application of the model, I investigate differences in stock market reactions across topics. I find evidence for strong positive or negative market reactions for some topics, like announcements of new Large Scale Projects or Bankruptcy Filings, while I do not observe significant price effects for some other topics. Furthermore, in contrast to previous studies, the multi-label structure of the model allows to analyze the effects of co-occurring topics on stock market reactions. For many cases, the reaction to a specific topic depends heavily on the co-occurrence with other topics. For example, if allocated capital from a Seasoned Equity Offering (SEO) is used for restructuring a company in the course of a Bankruptcy Proceeding, the market reacts positively on average. However, if that capital is used for covering unexpected, additional costs from the development of new drugs, the SEO implies negative reactions on average.  ( 2 min )
    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models. (arXiv:2311.07919v1 [eess.AS])
    Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.  ( 2 min )
    Memory-efficient Stochastic methods for Memory-based Transformers. (arXiv:2311.08123v1 [cs.LG])
    Training Memory-based transformers can require a large amount of memory and can be quite inefficient. We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers, which are often used for long-range context problems. For our experiments, we consider transformer-XL as our baseline model which is one of memorybased transformer models. We show that our resultant model, Skip Cross-head TransformerXL, outperforms the baseline on character level language modeling task with similar parameters and outperforms the baseline on word level language modelling task with almost 20% fewer parameters. Our proposed methods do not require any additional memory. We also demonstrate the effectiveness of our regularization mechanism on BERT which shows similar performance with reduction in standard deviation of scores of around 30% on multiple GLUE tasks.  ( 2 min )
    Data-Centric Financial Large Language Models. (arXiv:2310.17784v2 [cs.CL] UPDATED)
    Large language models (LLMs) show promise for natural language tasks but struggle when applied directly to complex domains like finance. LLMs have difficulty reasoning about and integrating all relevant information. We propose a data-centric approach to enable LLMs to better handle financial tasks. Our key insight is that rather than overloading the LLM with everything at once, it is more effective to preprocess and pre-understand the data. We create a financial LLM (FLLM) using multitask prompt-based finetuning to achieve data pre-processing and pre-understanding. However, labeled data is scarce for each task. To overcome manual annotation costs, we employ abductive augmentation reasoning (AAR) to automatically generate training data by modifying the pseudo labels from FLLM's own outputs. Experiments show our data-centric FLLM with AAR substantially outperforms baseline financial LLMs designed for raw text, achieving state-of-the-art on financial analysis and interpretation tasks. We also open source a new benchmark for financial analysis and interpretation. Our methodology provides a promising path to unlock LLMs' potential for complex real-world domains.  ( 2 min )
    HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. (arXiv:2306.15794v2 [cs.LG] UPDATED)
    Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity. Leveraging Hyena's new long-range capabilities, we present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level - an up to 500x increase over previous dense attention-based models. HyenaDNA scales sub-quadratically in sequence length (training up to 160x faster than Transformer), uses single nucleotide tokens, and has full global context at each layer. We explore what longer context enables - including the first use of in-context learning in genomics. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data. On the GenomicBenchmarks, HyenaDNA surpasses SotA on 7 of 8 datasets on average by +10 accuracy points. Code at https://github.com/HazyResearch/hyena-dna.  ( 3 min )
    High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise. (arXiv:2310.18784v2 [cs.LG] UPDATED)
    Several recent works have studied the convergence \textit{in high probability} of stochastic gradient descent (SGD) and its clipped variant. Compared to vanilla SGD, clipped SGD is practically more stable and has the additional theoretical benefit of logarithmic dependence on the failure probability. However, the convergence of other practical nonlinear variants of SGD, e.g., sign SGD, quantized SGD and normalized SGD, that achieve improved communication efficiency or accelerated convergence is much less understood. In this work, we study the convergence bounds \textit{in high probability} of a broad class of nonlinear SGD methods. For strongly convex loss functions with Lipschitz continuous gradients, we prove a logarithmic dependence on the failure probability, even when the noise is heavy-tailed. Strictly more general than the results for clipped SGD, our results hold for any nonlinearity with bounded (component-wise or joint) outputs, such as clipping, normalization, and quantization. Further, existing results with heavy-tailed noise assume bounded $\eta$-th central moments, with $\eta \in (1,2]$. In contrast, our refined analysis works even for $\eta=1$, strictly relaxing the noise moment assumptions in the literature.  ( 2 min )
    Gibbs-Based Information Criteria and the Over-Parameterized Regime. (arXiv:2306.05583v2 [cs.LG] UPDATED)
    Double-descent refers to the unexpected drop in test loss of a learning algorithm beyond an interpolating threshold with over-parameterization, which is not predicted by information criteria in their classical forms due to the limitations in the standard asymptotic approach. We update these analyses using the information risk minimization framework and provide Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) for models learned by the Gibbs algorithm. Notably, the penalty terms for the Gibbs-based AIC and BIC correspond to specific information measures, i.e., symmetrized KL information and KL divergence. We extend this information-theoretic analysis to over-parameterized models by providing two different Gibbs-based BICs to compute the marginal likelihood of random feature models in the regime where the number of parameters $p$ and the number of samples $n$ tend to infinity, with $p/n$ fixed. Our experiments demonstrate that the Gibbs-based BIC can select the high-dimensional model and reveal the mismatch between marginal likelihood and population risk in the over-parameterized regime, providing new insights to understand double-descent.  ( 2 min )
    Chip-Chat: Challenges and Opportunities in Conversational Hardware Design. (arXiv:2305.13243v2 [cs.LG] UPDATED)
    Modern hardware design starts with specifications provided in natural language. These are then translated by hardware engineers into appropriate Hardware Description Languages (HDLs) such as Verilog before synthesizing circuit elements. Automating this translation could reduce sources of human error from the engineering process. But, it is only recently that artificial intelligence (AI) has demonstrated capabilities for machine-based end-to-end design translations. Commercially-available instruction-tuned Large Language Models (LLMs) such as OpenAI's ChatGPT and Google's Bard claim to be able to produce code in a variety of programming languages; but studies examining them for hardware are still lacking. In this work, we thus explore the challenges faced and opportunities presented when leveraging these recent advances in LLMs for hardware design. Given that these `conversational' LLMs perform best when used interactively, we perform a case study where a hardware engineer co-architects a novel 8-bit accumulator-based microprocessor architecture with the LLM according to real-world hardware constraints. We then sent the processor to tapeout in a Skywater 130nm shuttle, meaning that this `Chip-Chat' resulted in what we believe to be the world's first wholly-AI-written HDL for tapeout.  ( 2 min )
    Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization. (arXiv:2306.09299v2 [cs.CL] UPDATED)
    A hallmark property of explainable AI models is the ability to teach other agents, communicating knowledge of how to perform a task. While Large Language Models perform complex reasoning by generating explanations for their predictions, it is unclear whether they also make good teachers for weaker agents. To address this, we consider a student-teacher framework between two LLM agents and study if, when, and how the teacher should intervene with natural language explanations to improve the student's performance. Since communication is expensive, we define a budget such that the teacher only communicates explanations for a fraction of the data, after which the student should perform well on its own. We decompose the teaching problem along four axes: (1) if teacher's test time intervention improve student predictions, (2) when it is worth explaining a data point, (3) how the teacher should personalize explanations to better teach the student, and (4) if teacher explanations also improve students on future unexplained data. We first show that teacher LLMs can indeed intervene on student reasoning to improve their performance. Next, inspired by the Theory of Mind abilities of effective teachers, we propose building two few-shot mental models of the student. The first model defines an Intervention Function that simulates the utility of an intervention, allowing the teacher to intervene when this utility is the highest and improving student performance at lower budgets. The second model enables the teacher to personalize explanations for a particular student and outperform unpersonalized teachers. We also demonstrate that in multi-turn interactions, teacher explanations generalize and learning from explained data improves student performance on future unexplained data. Finally, we verify that misaligned teachers can lower student performance to random chance by intentionally misleading them.  ( 3 min )
    Analyzing Transformer Dynamics as Movement through Embedding Space. (arXiv:2308.10874v2 [cs.LG] UPDATED)
    Transformer based language models exhibit intelligent behaviors such as understanding natural language, recognizing patterns, acquiring knowledge, reasoning, planning, reflecting and using tools. This paper explores how their underlying mechanics give rise to intelligent behaviors. Towards that end, we propose framing Transformer dynamics as movement through embedding space. Examining Transformers through this perspective reveals key insights, establishing a Theory of Transformers: 1) Intelligent behaviours map to paths in Embedding Space which, the Transformer random-walks through during inferencing. 2) LM training learns a probability distribution over all possible paths. `Intelligence' is learnt by assigning higher probabilities to paths representing intelligent behaviors. No learning can take place in-context; context only narrows the subset of paths sampled during decoding. 5) The Transformer is a self-mapping composition function, folding a context sequence into a context-vector such that it's proximity to a token-vector reflects its co-occurrence and conditioned probability. Thus, the physical arrangement of vectors in Embedding Space determines path probabilities. 6) Context vectors are composed by aggregating features of the sequence's tokens via a process we call the encoding walk. Attention contributes a - potentially redundant - association-bias to this process. 7) This process is comprised of two principal operation types: filtering (data independent) and aggregation (data dependent). This generalization unifies Transformers with other sequence models. Building upon this foundation, we formalize a popular semantic interpretation of embeddings into a ``concept-space theory'' and find some evidence of it's validity.  ( 3 min )
    Understanding Concept Identification as Consistent Data Clustering Across Multiple Feature Spaces. (arXiv:2301.05525v2 [cs.LG] UPDATED)
    Identifying meaningful concepts in large data sets can provide valuable insights into engineering design problems. Concept identification aims at identifying non-overlapping groups of design instances that are similar in a joint space of all features, but which are also similar when considering only subsets of features. These subsets usually comprise features that characterize a design with respect to one specific context, for example, constructive design parameters, performance values, or operation modes. It is desirable to evaluate the quality of design concepts by considering several of these feature subsets in isolation. In particular, meaningful concepts should not only identify dense, well separated groups of data instances, but also provide non-overlapping groups of data that persist when considering pre-defined feature subsets separately. In this work, we propose to view concept identification as a special form of clustering algorithm with a broad range of potential applications beyond engineering design. To illustrate the differences between concept identification and classical clustering algorithms, we apply a recently proposed concept identification algorithm to two synthetic data sets and show the differences in identified solutions. In addition, we introduce the mutual information measure as a metric to evaluate whether solutions return consistent clusters across relevant subsets. To support the novel understanding of concept identification, we consider a simulated data set from a decision-making problem in the energy management domain and show that the identified clusters are more interpretable with respect to relevant feature subsets than clusters found by common clustering algorithms and are thus more suitable to support a decision maker.  ( 3 min )
    Unsupervised Graph Deep Learning Reveals Emergent Flood Risk Profile of Urban Areas. (arXiv:2309.14610v3 [cs.LG] UPDATED)
    Urban flood risk emerges from complex and nonlinear interactions among multiple features related to flood hazard, flood exposure, and social and physical vulnerabilities, along with the complex spatial flood dependence relationships. Existing approaches for characterizing urban flood risk, however, are primarily based on flood plain maps, focusing on a limited number of features, primarily hazard and exposure features, without consideration of feature interactions or the dependence relationships among spatial areas. To address this gap, this study presents an integrated urban flood-risk rating model based on a novel unsupervised graph deep learning model (called FloodRisk-Net). FloodRisk-Net is capable of capturing spatial dependence among areas and complex and nonlinear interactions among flood hazards and urban features for specifying emergent flood risk. Using data from multiple metropolitan statistical areas (MSAs) in the United States, the model characterizes their flood risk into six distinct city-specific levels. The model is interpretable and enables feature analysis of areas within each flood-risk level, allowing for the identification of the three archetypes shaping the highest flood risk within each MSA. Flood risk is found to be spatially distributed in a hierarchical structure within each MSA, where the core city disproportionately bears the highest flood risk. Multiple cities are found to have high overall flood-risk levels and low spatial inequality, indicating limited options for balancing urban development and flood-risk reduction. Relevant flood-risk reduction strategies are discussed considering ways that the highest flood risk and uneven spatial distribution of flood risk are formed.  ( 3 min )
    Anomaly Detection in Industrial Machinery using IoT Devices and Machine Learning: a Systematic Mapping. (arXiv:2307.15807v2 [cs.LG] UPDATED)
    Anomaly detection is critical in the smart industry for preventing equipment failure, reducing downtime, and improving safety. Internet of Things (IoT) has enabled the collection of large volumes of data from industrial machinery, providing a rich source of information for Anomaly Detection. However, the volume and complexity of data generated by the Internet of Things ecosystems make it difficult for humans to detect anomalies manually. Machine learning (ML) algorithms can automate anomaly detection in industrial machinery by analyzing generated data. Besides, each technique has specific strengths and weaknesses based on the data nature and its corresponding systems. However, the current systematic mapping studies on Anomaly Detection primarily focus on addressing network and cybersecurity-related problems, with limited attention given to the industrial sector. Additionally, these studies do not cover the challenges involved in using ML for Anomaly Detection in industrial machinery within the context of the IoT ecosystems. This paper presents a systematic mapping study on Anomaly Detection for industrial machinery using IoT devices and ML algorithms to address this gap. The study comprehensively evaluates 84 relevant studies spanning from 2016 to 2023, providing an extensive review of Anomaly Detection research. Our findings identify the most commonly used algorithms, preprocessing techniques, and sensor types. Additionally, this review identifies application areas and points to future challenges and research opportunities.  ( 3 min )
    Learning Over Contracting and Lipschitz Closed-Loops for Partially-Observed Nonlinear Systems (Extended Version). (arXiv:2304.06193v2 [eess.SY] UPDATED)
    This paper presents a policy parameterization for learning-based control on nonlinear, partially-observed dynamical systems. The parameterization is based on a nonlinear version of the Youla parameterization and the recently proposed Recurrent Equilibrium Network (REN) class of models. We prove that the resulting Youla-REN parameterization automatically satisfies stability (contraction) and user-tunable robustness (Lipschitz) conditions on the closed-loop system. This means it can be used for safe learning-based control with no additional constraints or projections required to enforce stability or robustness. We test the new policy class in simulation on two reinforcement learning tasks: 1) magnetic suspension, and 2) inverting a rotary-arm pendulum. We find that the Youla-REN performs similarly to existing learning-based and optimal control methods while also ensuring stability and exhibiting improved robustness to adversarial disturbances.  ( 2 min )
    Homological Convolutional Neural Networks. (arXiv:2308.13816v2 [cs.LG] UPDATED)
    Deep learning methods have demonstrated outstanding performances on classification and regression tasks on homogeneous data types (e.g., image, audio, and text data). However, tabular data still pose a challenge, with classic machine learning approaches being often computationally cheaper and equally effective than increasingly complex deep learning architectures. The challenge arises from the fact that, in tabular data, the correlation among features is weaker than the one from spatial or semantic relationships in images or natural language, and the dependency structures need to be modeled without any prior information. In this work, we propose a novel deep learning architecture that exploits the data structural organization through topologically constrained network representations to gain relational information from sparse tabular inputs. The resulting model leverages the power of convolution and is centered on a limited number of concepts from network topology to guarantee: (i) a data-centric and deterministic building pipeline; (ii) a high level of interpretability over the inference process; and (iii) an adequate room for scalability. We test our model on 18 benchmark datasets against 5 classic machine learning and 3 deep learning models, demonstrating that our approach reaches state-of-the-art performances on these challenging datasets. The code to reproduce all our experiments is provided at https://github.com/FinancialComputingUCL/HomologicalCNN.  ( 2 min )
    Mixed Attention Network for Cross-domain Sequential Recommendation. (arXiv:2311.08272v1 [cs.IR])
    In modern recommender systems, sequential recommendation leverages chronological user behaviors to make effective next-item suggestions, which suffers from data sparsity issues, especially for new users. One promising line of work is the cross-domain recommendation, which trains models with data across multiple domains to improve the performance in data-scarce domains. Recent proposed cross-domain sequential recommendation models such as PiNet and DASL have a common drawback relying heavily on overlapped users in different domains, which limits their usage in practical recommender systems. In this paper, we propose a Mixed Attention Network (MAN) with local and global attention modules to extract the domain-specific and cross-domain information. Firstly, we propose a local/global encoding layer to capture the domain-specific/cross-domain sequential pattern. Then we propose a mixed attention layer with item similarity attention, sequence-fusion attention, and group-prototype attention to capture the local/global item similarity, fuse the local/global item sequence, and extract the user groups across different domains, respectively. Finally, we propose a local/global prediction layer to further evolve and combine the domain-specific and cross-domain interests. Experimental results on two real-world datasets (each with two domains) demonstrate the superiority of our proposed model. Further study also illustrates that our proposed method and components are model-agnostic and effective, respectively. The code and data are available at https://github.com/Guanyu-Lin/MAN.
    Federated Skewed Label Learning with Logits Fusion. (arXiv:2311.08202v1 [cs.LG])
    Federated learning (FL) aims to collaboratively train a shared model across multiple clients without transmitting their local data. Data heterogeneity is a critical challenge in realistic FL settings, as it causes significant performance deterioration due to discrepancies in optimization among local models. In this work, we focus on label distribution skew, a common scenario in data heterogeneity, where the data label categories are imbalanced on each client. To address this issue, we propose FedBalance, which corrects the optimization bias among local models by calibrating their logits. Specifically, we introduce an extra private weak learner on the client side, which forms an ensemble model with the local model. By fusing the logits of the two models, the private weak learner can capture the variance of different data, regardless of their category. Therefore, the optimization direction of local models can be improved by increasing the penalty for misclassifying minority classes and reducing the attention to majority classes, resulting in a better global model. Extensive experiments show that our method can gain 13\% higher average accuracy compared with state-of-the-art methods.
    A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation. (arXiv:2106.06854v2 [cs.LG] UPDATED)
    Marginalized importance sampling (MIS), which measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution, is a promising approach for off-policy evaluation. However, current state-of-the-art MIS methods rely on complex optimization tricks and succeed mostly on simple toy problems. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep reinforcement learning methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.
    A Diffusion-based Method for Multi-turn Compositional Image Generation. (arXiv:2304.02192v2 [cs.CV] UPDATED)
    Multi-turn compositional image generation (M-CIG) is a challenging task that aims to iteratively manipulate a reference image given a modification text. While most of the existing methods for M-CIG are based on generative adversarial networks (GANs), recent advances in image generation have demonstrated the superiority of diffusion models over GANs. In this paper, we propose a diffusion-based method for M-CIG named conditional denoising diffusion with image compositional matching (CDD-ICM). We leverage CLIP as the backbone of image and text encoders, and incorporate a gated fusion mechanism, originally proposed for question answering, to compositionally fuse the reference image and the modification text at each turn of M-CIG. We introduce a conditioning scheme to generate the target image based on the fusion results. To prioritize the semantic quality of the generated target image, we learn an auxiliary image compositional match (ICM) objective, along with the conditional denoising diffusion (CDD) objective in a multi-task learning framework. Additionally, we also perform ICM guidance and classifier-free guidance to improve performance. Experimental results show that CDD-ICM achieves state-of-the-art results on two benchmark datasets for M-CIG, i.e., CoDraw and i-CLEVR.
    Communication-Constrained Bayesian Active Knowledge Distillation. (arXiv:2311.08053v1 [cs.LG])
    Consider an active learning setting in which a learner has a training set with few labeled examples and a pool set with many unlabeled inputs, while a remote teacher has a pre-trained model that is known to perform well for the learner's task. The learner actively transmits batches of unlabeled inputs to the teacher through a constrained communication channel for labeling. This paper addresses the following key questions: (i) Active batch selection: Which batch of inputs should be sent to the teacher to acquire the most useful information and thus reduce the number of required communication rounds? (ii) Batch encoding: How do we encode the batch of inputs for transmission to the teacher to reduce the communication resources required at each round? We introduce Communication-Constrained Bayesian Active Knowledge Distillation (CC-BAKD), a novel protocol that integrates Bayesian active learning with compression via a linear mix-up mechanism. Bayesian active learning selects the batch of inputs based on their epistemic uncertainty, addressing the "confirmation bias" that is known to increase the number of required communication rounds. Furthermore, the proposed mix-up compression strategy is integrated with the epistemic uncertainty-based active batch selection process to reduce the communication overhead per communication round.  ( 2 min )
    EGRC-Net: Embedding-induced Graph Refinement Clustering Network. (arXiv:2211.10627v2 [cs.LG] UPDATED)
    Existing graph clustering networks heavily rely on a predefined yet fixed graph, which can lead to failures when the initial graph fails to accurately capture the data topology structure of the embedding space. In order to address this issue, we propose a novel clustering network called Embedding-Induced Graph Refinement Clustering Network (EGRC-Net), which effectively utilizes the learned embedding to adaptively refine the initial graph and enhance the clustering performance. To begin, we leverage both semantic and topological information by employing a vanilla auto-encoder and a graph convolution network, respectively, to learn a latent feature representation. Subsequently, we utilize the local geometric structure within the feature embedding space to construct an adjacency matrix for the graph. This adjacency matrix is dynamically fused with the initial one using our proposed fusion architecture. To train the network in an unsupervised manner, we minimize the Jeffreys divergence between multiple derived distributions. Additionally, we introduce an improved approximate personalized propagation of neural predictions to replace the standard graph convolution network, enabling EGRC-Net to scale effectively. Through extensive experiments conducted on nine widely-used benchmark datasets, we demonstrate that our proposed methods consistently outperform several state-of-the-art approaches. Notably, EGRC-Net achieves an improvement of more than 11.99\% in Adjusted Rand Index (ARI) over the best baseline on the DBLP dataset. Furthermore, our scalable approach exhibits a 10.73% gain in ARI while reducing memory usage by 33.73% and decreasing running time by 19.71%. The code for EGRC-Net will be made publicly available at \url{https://github.com/ZhihaoPENG-CityU/EGRC-Net}.  ( 3 min )
    Neural Moving Horizon Estimation for Robust Flight Control. (arXiv:2206.10397v13 [cs.RO] UPDATED)
    Estimating and reacting to disturbances is crucial for robust flight control of quadrotors. Existing estimators typically require significant tuning for a specific flight scenario or training with extensive ground-truth disturbance data to achieve satisfactory performance. In this paper, we propose a neural moving horizon estimator (NeuroMHE) that can automatically tune its key parameters modeled by a neural network and adapt to different flight scenarios. We achieve this by deriving the analytical gradients of the MHE estimates with respect to the MHE weighting matrices, which enables a seamless embedding of the MHE as a learnable layer into the neural network for highly effective learning. Interestingly, we show that the gradients can be computed efficiently using a Kalman filter in a recursive form. Moreover, we develop a model-based policy gradient algorithm to train NeuroMHE directly from the quadrotor trajectory tracking error without needing the ground-truth disturbance data. The effectiveness of NeuroMHE is verified extensively via both simulations and physical experiments on quadrotors in various challenging flights. Notably, NeuroMHE outperforms a state-of-the-art neural network-based estimator, reducing force estimation errors by up to 76.7%, while using a portable neural network that has only 7.7% of the learnable parameters of the latter. The proposed method is general and can be applied to robust adaptive control of other robotic systems.  ( 3 min )
    Efficient Surrogate Models for Materials Science Simulations: Machine Learning-based Prediction of Microstructure Properties. (arXiv:2309.00305v2 [cs.LG] UPDATED)
    Determining, understanding, and predicting the so-called structure-property relation is an important task in many scientific disciplines, such as chemistry, biology, meteorology, physics, engineering, and materials science. Structure refers to the spatial distribution of, e.g., substances, material, or matter in general, while property is a resulting characteristic that usually depends in a non-trivial way on spatial details of the structure. Traditionally, forward simulations models have been used for such tasks. Recently, several machine learning algorithms have been applied in these scientific fields to enhance and accelerate simulation models or as surrogate models. In this work, we develop and investigate the applications of six machine learning techniques based on two different datasets from the domain of materials science: data from a two-dimensional Ising model for predicting the formation of magnetic domains and data representing the evolution of dual-phase microstructures from the Cahn-Hilliard model. We analyze the accuracy and robustness of all models and elucidate the reasons for the differences in their performances. The impact of including domain knowledge through tailored features is studied, and general recommendations based on the availability and quality of training data are derived from this.  ( 2 min )
    Can Knowledge Graphs Reduce Hallucinations in LLMs? : A Survey. (arXiv:2311.07914v1 [cs.CL])
    The contemporary LLMs are prone to producing hallucinations, stemming mainly from the knowledge gaps within the models. To address this critical limitation, researchers employ diverse strategies to augment the LLMs by incorporating external knowledge, aiming to reduce hallucinations and enhance reasoning accuracy. Among these strategies, leveraging knowledge graphs as a source of external information has demonstrated promising results. In this survey, we conduct a comprehensive review of these knowledge-graph-based knowledge augmentation techniques in LLMs, focusing on their efficacy in mitigating hallucinations. We systematically categorize these methods into three overarching groups, offering both methodological comparisons and empirical evaluations of their performance. Lastly, the paper explores the challenges associated with these techniques and outlines potential avenues for future research in this emerging field.
    Diffusion-based generation of Histopathological Whole Slide Images at a Gigapixel scale. (arXiv:2311.08199v1 [eess.IV])
    We present a novel diffusion-based approach to generate synthetic histopathological Whole Slide Images (WSIs) at an unprecedented gigapixel scale. Synthetic WSIs have many potential applications: They can augment training datasets to enhance the performance of many computational pathology applications. They allow the creation of synthesized copies of datasets that can be shared without violating privacy regulations. Or they can facilitate learning representations of WSIs without requiring data annotations. Despite this variety of applications, no existing deep-learning-based method generates WSIs at their typically high resolutions. Mainly due to the high computational complexity. Therefore, we propose a novel coarse-to-fine sampling scheme to tackle image generation of high-resolution WSIs. In this scheme, we increase the resolution of an initial low-resolution image to a high-resolution WSI. Particularly, a diffusion model sequentially adds fine details to images and increases their resolution. In our experiments, we train our method with WSIs from the TCGA-BRCA dataset. Additionally to quantitative evaluations, we also performed a user study with pathologists. The study results suggest that our generated WSIs resemble the structure of real WSIs.
    Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain. (arXiv:2311.07766v1 [cs.CV])
    Integrating information from multiple modalities is arguably one of the essential prerequisites for grounding artificial intelligence systems with an understanding of the real world. Recent advances in video transformers that jointly learn from vision, text, and sound over time have made some progress toward this goal, but the degree to which these models integrate information from modalities still remains unclear. In this work, we present a promising approach for probing a pre-trained multimodal video transformer model by leveraging neuroscientific evidence of multimodal information processing in the brain. Using brain recordings of participants watching a popular TV show, we analyze the effects of multi-modal connections and interactions in a pre-trained multi-modal video transformer on the alignment with uni- and multi-modal brain regions. We find evidence that vision enhances masked prediction performance during language processing, providing support that cross-modal representations in models can benefit individual modalities. However, we don't find evidence of brain-relevant information captured by the joint multi-modal transformer representations beyond that captured by all of the individual modalities. We finally show that the brain alignment of the pre-trained joint representation can be improved by fine-tuning using a task that requires vision-language inferences. Overall, our results paint an optimistic picture of the ability of multi-modal transformers to integrate vision and language in partially brain-relevant ways but also show that improving the brain alignment of these models may require new approaches.  ( 3 min )
    Lite it fly: An All-Deformable-Butterfly Network. (arXiv:2311.08125v1 [cs.LG])
    Most deep neural networks (DNNs) consist fundamentally of convolutional and/or fully connected layers, wherein the linear transform can be cast as the product between a filter matrix and a data matrix obtained by arranging feature tensors into columns. The lately proposed deformable butterfly (DeBut) decomposes the filter matrix into generalized, butterflylike factors, thus achieving network compression orthogonal to the traditional ways of pruning or low-rank decomposition. This work reveals an intimate link between DeBut and a systematic hierarchy of depthwise and pointwise convolutions, which explains the empirically good performance of DeBut layers. By developing an automated DeBut chain generator, we show for the first time the viability of homogenizing a DNN into all DeBut layers, thus achieving an extreme sparsity and compression. Various examples and hardware benchmarks verify the advantages of All-DeBut networks. In particular, we show it is possible to compress a PointNet to < 5% parameters with < 5% accuracy drop, a record not achievable by other compression schemes.
    Fast swap regret minimization and applications to approximate correlated equilibria. (arXiv:2310.19647v2 [cs.GT] UPDATED)
    We give a simple and computationally efficient algorithm that, for any constant $\varepsilon>0$, obtains $\varepsilon T$-swap regret within only $T = \mathsf{polylog}(n)$ rounds; this is an exponential improvement compared to the super-linear number of rounds required by the state-of-the-art algorithm, and resolves the main open problem of [Blum and Mansour 2007]. Our algorithm has an exponential dependence on $\varepsilon$, but we prove a new, matching lower bound. Our algorithm for swap regret implies faster convergence to $\varepsilon$-Correlated Equilibrium ($\varepsilon$-CE) in several regimes: For normal form two-player games with $n$ actions, it implies the first uncoupled dynamics that converges to the set of $\varepsilon$-CE in polylogarithmic rounds; a $\mathsf{polylog}(n)$-bit communication protocol for $\varepsilon$-CE in two-player games (resolving an open problem mentioned by [Babichenko-Rubinstein'2017, Goos-Rubinstein'2018, Ganor-CS'2018]); and an $\tilde{O}(n)$-query algorithm for $\varepsilon$-CE (resolving an open problem of [Babichenko'2020] and obtaining the first separation between $\varepsilon$-CE and $\varepsilon$-Nash equilibrium in the query complexity model). For extensive-form games, our algorithm implies a PTAS for $\mathit{normal}$ $\mathit{form}$ $\mathit{correlated}$ $\mathit{equilibria}$, a solution concept often conjectured to be computationally intractable (e.g. [Stengel-Forges'08, Fujii'23]).
    A Simple Quantum Blockmodeling with Qubits and Permutations. (arXiv:2311.07726v1 [quant-ph])
    Blockmodeling of a given problem represented by an $N\times N$ adjacency matrix can be found by swapping rows and columns of the matrix (i.e. multiplying matrix from left and right by a permutation matrix). In general, through performing this task, row and column permutations affect the fitness value in optimization: For an $N\times N$ matrix, it requires $O(N)$ computations to find (or update) the fitness value of a candidate solution. On quantum computers, permutations can be applied in parallel and efficiently, and their implementations can be as simple as a single qubit operation (a NOT gate on a qubit) which takes an $O(1)$ time algorithmic step. In this paper, using permutation matrices, we describe a quantum blockmodeling for data analysis tasks. In the model, the measurement outcome of a small group of qubits are mapped to indicate the fitness value. Therefore, we show that it is possible to find or update the fitness value in $O(log(N))$ time. This lead us to show that when the number of iterations are less than $log(N)$ time, it may be possible to reach the same solution exponentially faster on quantum computers in comparison to classical computers. In addition, since on quantum circuits the different sequence of permutations can be applied in parallel (superpositon), the machine learning task in this model can be implemented more efficiently on quantum computers.  ( 2 min )
    In-context Learning and Gradient Descent Revisited. (arXiv:2311.07772v1 [cs.CL])
    In-context learning (ICL) has shown impressive results in few-shot learning tasks, yet its underlying mechanism is still not fully understood. Recent works suggest that ICL can be thought of as a gradient descent (GD) based optimization process. While promising, these results mainly focus on simplified settings of ICL and provide only a preliminary evaluation of the similarities between the two methods. In this work, we revisit the comparison between ICL and GD-based finetuning and study what properties of ICL an equivalent process must follow. We highlight a major difference in the flow of information between ICL and standard finetuning. Namely, ICL can only rely on information from lower layers at every point, while finetuning depends on loss gradients from deeper layers. We refer to this discrepancy as Layer Causality and show that a layer causal variant of the finetuning process aligns with ICL on par with vanilla finetuning and is even better in most cases across relevant metrics. To the best of our knowledge, this is the first work to discuss this discrepancy explicitly and suggest a solution that tackles this problem with minimal changes.  ( 2 min )
    Large Scale Prediction with Decision Trees. (arXiv:2104.13881v5 [stat.ML] UPDATED)
    This paper shows that decision trees constructed with Classification and Regression Trees (CART) and C4.5 methodology are consistent for regression and classification tasks, even when the number of predictor variables grows sub-exponentially with the sample size, under natural 0-norm and 1-norm sparsity constraints. The theory applies to a wide range of models, including (ordinary or logistic) additive regression models with component functions that are continuous, of bounded variation, or, more generally, Borel measurable. Consistency holds for arbitrary joint distributions of the predictor variables, thereby accommodating continuous, discrete, and/or dependent data. Finally, we show that these qualitative properties of individual trees are inherited by Breiman's random forests. A key step in the analysis is the establishment of an oracle inequality, which allows for a precise characterization of the goodness-of-fit and complexity tradeoff for a mis-specified model.
    Explainable History Distillation by Marked Temporal Point Process. (arXiv:2311.07797v1 [cs.LG])
    Explainability of machine learning models is mandatory when researchers introduce these commonly believed black boxes to real-world tasks, especially high-stakes ones. In this paper, we build a machine learning system to automatically generate explanations of happened events from history by \gls{ca} based on the \acrfull{tpp}. Specifically, we propose a new task called \acrfull{ehd}. This task requires a model to distill as few events as possible from observed history. The target is that the event distribution conditioned on left events predicts the observed future noticeably worse. We then regard distilled events as the explanation for the future. To efficiently solve \acrshort{ehd}, we rewrite the task into a \gls{01ip} and directly estimate the solution to the program by a model called \acrfull{model}. This work fills the gap between our task and existing works, which only spot the difference between factual and counterfactual worlds after applying a predefined modification to the environment. Experiment results on Retweet and StackOverflow datasets prove that \acrshort{model} significantly outperforms other \acrshort{ehd} baselines and can reveal the rationale underpinning real-world processes.  ( 2 min )
    Solving ARC visual analogies with neural embeddings and vector arithmetic: A generalized method. (arXiv:2311.08083v1 [cs.AI])
    Analogical reasoning derives information from known relations and generalizes this information to similar yet unfamiliar situations. One of the first generalized ways in which deep learning models were able to solve verbal analogies was through vector arithmetic of word embeddings, essentially relating words that were mapped to a vector space (e.g., king - man + woman = __?). In comparison, most attempts to solve visual analogies are still predominantly task-specific and less generalizable. This project focuses on visual analogical reasoning and applies the initial generalized mechanism used to solve verbal analogies to the visual realm. Taking the Abstraction and Reasoning Corpus (ARC) as an example to investigate visual analogy solving, we use a variational autoencoder (VAE) to transform ARC items into low-dimensional latent vectors, analogous to the word embeddings used in the verbal approaches. Through simple vector arithmetic, underlying rules of ARC items are discovered and used to solve them. Results indicate that the approach works well on simple items with fewer dimensions (i.e., few colors used, uniform shapes), similar input-to-output examples, and high reconstruction accuracy on the VAE. Predictions on more complex items showed stronger deviations from expected outputs, although, predictions still often approximated parts of the item's rule set. Error patterns indicated that the model works as intended. On the official ARC paradigm, the model achieved a score of 2% (cf. current world record is 21%) and on ConceptARC it scored 8.8%. Although the methodology proposed involves basic dimensionality reduction techniques and standard vector arithmetic, this approach demonstrates promising outcomes on ARC and can easily be generalized to other abstract visual reasoning tasks.  ( 3 min )
    Iterative missing value imputation based on feature importance. (arXiv:2311.08005v1 [cs.LG])
    Many datasets suffer from missing values due to various reasons,which not only increases the processing difficulty of related tasks but also reduces the accuracy of classification. To address this problem, the mainstream approach is to use missing value imputation to complete the dataset. Existing imputation methods estimate the missing parts based on the observed values in the original feature space, and they treat all features as equally important during data completion, while in fact different features have different importance. Therefore, we have designed an imputation method that considers feature importance. This algorithm iteratively performs matrix completion and feature importance learning, and specifically, matrix completion is based on a filling loss that incorporates feature importance. Our experimental analysis involves three types of datasets: synthetic datasets with different noisy features and missing values, real-world datasets with artificially generated missing values, and real-world datasets originally containing missing values. The results on these datasets consistently show that the proposed method outperforms the existing five imputation algorithms.To the best of our knowledge, this is the first work that considers feature importance in the imputation model.  ( 2 min )
    On the Lipschitz Constant of Deep Networks and Double Descent. (arXiv:2301.12309v4 [cs.LG] UPDATED)
    Existing bounds on the generalization error of deep networks assume some form of smooth or bounded dependence on the input variable, falling short of investigating the mechanisms controlling such factors in practice. In this work, we present an extensive experimental study of the empirical Lipschitz constant of deep networks undergoing double descent, and highlight non-monotonic trends strongly correlating with the test error. Building a connection between parameter-space and input-space gradients for SGD around a critical point, we isolate two important factors -- namely loss landscape curvature and distance of parameters from initialization -- respectively controlling optimization dynamics around a critical point and bounding model function complexity, even beyond the training data. Our study presents novels insights on implicit regularization via overparameterization, and effective model complexity for networks trained in practice.  ( 2 min )
    Cross-modal Generative Model for Visual-Guided Binaural Stereo Generation. (arXiv:2311.07630v1 [cs.SD])
    Binaural stereo audio is recorded by imitating the way the human ear receives sound, which provides people with an immersive listening experience. Existing approaches leverage autoencoders and directly exploit visual spatial information to synthesize binaural stereo, resulting in a limited representation of visual guidance. For the first time, we propose a visually guided generative adversarial approach for generating binaural stereo audio from mono audio. Specifically, we develop a Stereo Audio Generation Model (SAGM), which utilizes shared spatio-temporal visual information to guide the generator and the discriminator to work separately. The shared visual information is updated alternately in the generative adversarial stage, allowing the generator and discriminator to deliver their respective guided knowledge while visually sharing. The proposed method learns bidirectional complementary visual information, which facilitates the expression of visual guidance in generation. In addition, spatial perception is a crucial attribute of binaural stereo audio, and thus the evaluation of stereo spatial perception is essential. However, previous metrics failed to measure the spatial perception of audio. To this end, a metric to measure the spatial perception of audio is proposed for the first time. The proposed metric is capable of measuring the magnitude and direction of spatial perception in the temporal dimension. Further, considering its function, it is feasible to utilize it instead of demanding user studies to some extent. The proposed method achieves state-of-the-art performance on 2 datasets and 5 evaluation metrics. Qualitative experiments and user studies demonstrate that the method generates space-realistic stereo audio.  ( 3 min )
    PEMS: Pre-trained Epidmic Time-series Models. (arXiv:2311.07841v1 [cs.LG])
    Providing accurate and reliable predictions about the future of an epidemic is an important problem for enabling informed public health decisions. Recent works have shown that leveraging data-driven solutions that utilize advances in deep learning methods to learn from past data of an epidemic often outperform traditional mechanistic models. However, in many cases, the past data is sparse and may not sufficiently capture the underlying dynamics. While there exists a large amount of data from past epidemics, leveraging prior knowledge from time-series data of other diseases is a non-trivial challenge. Motivated by the success of pre-trained models in language and vision tasks, we tackle the problem of pre-training epidemic time-series models to learn from multiple datasets from different diseases and epidemics. We introduce Pre-trained Epidemic Time-Series Models (PEMS) that learn from diverse time-series datasets of a variety of diseases by formulating pre-training as a set of self-supervised learning (SSL) tasks. We tackle various important challenges specific to pre-training for epidemic time-series such as dealing with heterogeneous dynamics and efficiently capturing useful patterns from multiple epidemic datasets by carefully designing the SSL tasks to learn important priors about the epidemic dynamics that can be leveraged for fine-tuning to multiple downstream tasks. The resultant PEM outperforms previous state-of-the-art methods in various downstream time-series tasks across datasets of varying seasonal patterns, geography, and mechanism of contagion including the novel Covid-19 pandemic unseen in pre-trained data with better efficiency using smaller fraction of datasets.  ( 3 min )
    Unsupervised Musical Object Discovery from Audio. (arXiv:2311.07534v2 [cs.SD] UPDATED)
    Current object-centric learning models such as the popular SlotAttention architecture allow for unsupervised visual scene decomposition. Our novel MusicSlots method adapts SlotAttention to the audio domain, to achieve unsupervised music decomposition. Since concepts of opacity and occlusion in vision have no auditory analogues, the softmax normalization of alpha masks in the decoders of visual object-centric models is not well-suited for decomposing audio objects. MusicSlots overcomes this problem. We introduce a spectrogram-based multi-object music dataset tailored to evaluate object-centric learning on western tonal music. MusicSlots achieves good performance on unsupervised note discovery and outperforms several established baselines on supervised note property prediction tasks.  ( 2 min )
    Two-Stage Predict+Optimize for Mixed Integer Linear Programs with Unknown Parameters in Constraints. (arXiv:2311.08022v1 [cs.AI])
    Consider the setting of constrained optimization, with some parameters unknown at solving time and requiring prediction from relevant features. Predict+Optimize is a recent framework for end-to-end training supervised learning models for such predictions, incorporating information about the optimization problem in the training process in order to yield better predictions in terms of the quality of the predicted solution under the true parameters. Almost all prior works have focused on the special case where the unknowns appear only in the optimization objective and not the constraints. Hu et al.~proposed the first adaptation of Predict+Optimize to handle unknowns appearing in constraints, but the framework has somewhat ad-hoc elements, and they provided a training algorithm only for covering and packing linear programs. In this work, we give a new \emph{simpler} and \emph{more powerful} framework called \emph{Two-Stage Predict+Optimize}, which we believe should be the canonical framework for the Predict+Optimize setting. We also give a training algorithm usable for all mixed integer linear programs, vastly generalizing the applicability of the framework. Experimental results demonstrate the superior prediction performance of our training framework over all classical and state-of-the-art methods.  ( 2 min )
    Modeling Complex Disease Trajectories using Deep Generative Models with Semi-Supervised Latent Processes. (arXiv:2311.08149v1 [cs.LG])
    In this paper, we propose a deep generative time series approach using latent temporal processes for modeling and holistically analyzing complex disease trajectories. We aim to find meaningful temporal latent representations of an underlying generative process that explain the observed disease trajectories in an interpretable and comprehensive way. To enhance the interpretability of these latent temporal processes, we develop a semi-supervised approach for disentangling the latent space using established medical concepts. By combining the generative approach with medical knowledge, we leverage the ability to discover novel aspects of the disease while integrating medical concepts into the model. We show that the learned temporal latent processes can be utilized for further data analysis and clinical hypothesis testing, including finding similar patients and clustering the disease into new sub-types. Moreover, our method enables personalized online monitoring and prediction of multivariate time series including uncertainty quantification. We demonstrate the effectiveness of our approach in modeling systemic sclerosis, showcasing the potential of our machine learning model to capture complex disease trajectories and acquire new medical knowledge.  ( 2 min )
    Balancing central and marginal rejection when combining independent significance tests. (arXiv:2310.16600v2 [stat.ME] UPDATED)
    A common approach to evaluating the significance of a collection of $p$-values combines them with a pooling function, in particular when the original data are not available. These pooled $p$-values convert a sample of $p$-values into a single number which behaves like a univariate $p$-value. To clarify discussion of these functions, a telescoping series of alternative hypotheses are introduced that communicate the strength and prevalence of non-null evidence in the $p$-values before general pooling formulae are discussed. A pattern noticed in the UMP pooled $p$-value for a particular alternative motivates the definition and discussion of central and marginal rejection levels at $\alpha$. It is proven that central rejection is always greater than or equal to marginal rejection, motivating a quotient to measure the balance between the two for pooled $p$-values. A combining function based on the $\chi^2_{\kappa}$ quantile transformation is proposed to control this quotient and shown to be robust to mis-specified parameters relative to the UMP. Different powers for different parameter settings motivate a map of plausible alternatives based on where this pooled $p$-value is minimized.  ( 2 min )
    Linear Partial Monitoring for Sequential Decision-Making: Algorithms, Regret Bounds and Applications. (arXiv:2302.03683v2 [cs.LG] UPDATED)
    Partial monitoring is an expressive framework for sequential decision-making with an abundance of applications, including graph-structured and dueling bandits, dynamic pricing and transductive feedback models. We survey and extend recent results on the linear formulation of partial monitoring that naturally generalizes the standard linear bandit setting. The main result is that a single algorithm, information-directed sampling (IDS), is (nearly) worst-case rate optimal in all finite-action games. We present a simple and unified analysis of stochastic partial monitoring, and further extend the model to the contextual and kernelized setting.  ( 2 min )
    Statistical Parameterized Physics-Based Machine Learning Digital Twin Models for Laser Powder Bed Fusion Process. (arXiv:2311.07821v1 [cs.LG])
    A digital twin (DT) is a virtual representation of physical process, products and/or systems that requires a high-fidelity computational model for continuous update through the integration of sensor data and user input. In the context of laser powder bed fusion (LPBF) additive manufacturing, a digital twin of the manufacturing process can offer predictions for the produced parts, diagnostics for manufacturing defects, as well as control capabilities. This paper introduces a parameterized physics-based digital twin (PPB-DT) for the statistical predictions of LPBF metal additive manufacturing process. We accomplish this by creating a high-fidelity computational model that accurately represents the melt pool phenomena and subsequently calibrating and validating it through controlled experiments. In PPB-DT, a mechanistic reduced-order method-driven stochastic calibration process is introduced, which enables the statistical predictions of the melt pool geometries and the identification of defects such as lack-of-fusion porosity and surface roughness, specifically for diagnostic applications. Leveraging data derived from this physics-based model and experiments, we have trained a machine learning-based digital twin (PPB-ML-DT) model for predicting, monitoring, and controlling melt pool geometries. These proposed digital twin models can be employed for predictions, control, optimization, and quality assurance within the LPBF process, ultimately expediting product development and certification in LPBF-based metal additive manufacturing.  ( 3 min )
    Self-supervised Heterogeneous Graph Variational Autoencoders. (arXiv:2311.07929v1 [cs.LG])
    Heterogeneous Information Networks (HINs), which consist of various types of nodes and edges, have recently demonstrated excellent performance in graph mining. However, most existing heterogeneous graph neural networks (HGNNs) ignore the problems of missing attributes, inaccurate attributes and scarce labels for nodes, which limits their expressiveness. In this paper, we propose a generative self-supervised model SHAVA to address these issues simultaneously. Specifically, SHAVA first initializes all the nodes in the graph with a low-dimensional representation matrix. After that, based on the variational graph autoencoder framework, SHAVA learns both node-level and attribute-level embeddings in the encoder, which can provide fine-grained semantic information to construct node attributes. In the decoder, SHAVA reconstructs both links and attributes. Instead of directly reconstructing raw features for attributed nodes, SHAVA generates the initial low-dimensional representation matrix for all the nodes, based on which raw features of attributed nodes are further reconstructed to leverage accurate attributes. In this way, SHAVA can not only complete informative features for non-attributed nodes, but rectify inaccurate ones for attributed nodes. Finally, we conduct extensive experiments to show the superiority of SHAVA in tackling HINs with missing and inaccurate attributes.  ( 2 min )
    Self-Consistent Velocity Matching of Probability Flows. (arXiv:2301.13737v4 [cs.LG] UPDATED)
    We present a discretization-free scalable framework for solving a large class of mass-conserving partial differential equations (PDEs), including the time-dependent Fokker-Planck equation and the Wasserstein gradient flow. The main observation is that the time-varying velocity field of the PDE solution needs to be self-consistent: it must satisfy a fixed-point equation involving the probability flow characterized by the same velocity field. Instead of directly minimizing the residual of the fixed-point equation with neural parameterization, we use an iterative formulation with a biased gradient estimator that bypasses significant computational obstacles with strong empirical performance. Compared to existing approaches, our method does not suffer from temporal or spatial discretization, covers a wider range of PDEs, and scales to high dimensions. Experimentally, our method recovers analytical solutions accurately when they are available and achieves superior performance in high dimensions with less training time compared to alternatives.  ( 2 min )
    Missing Value Imputation for Multi-attribute Sensor Data Streams via Message Propagation (Extended Version). (arXiv:2311.07344v2 [cs.DB] UPDATED)
    Sensor data streams occur widely in various real-time applications in the context of the Internet of Things (IoT). However, sensor data streams feature missing values due to factors such as sensor failures, communication errors, or depleted batteries. Missing values can compromise the quality of real-time analytics tasks and downstream applications. Existing imputation methods either make strong assumptions about streams or have low efficiency. In this study, we aim to accurately and efficiently impute missing values in data streams that satisfy only general characteristics in order to benefit real-time applications more widely. First, we propose a message propagation imputation network (MPIN) that is able to recover the missing values of data instances in a time window. We give a theoretical analysis of why MPIN is effective. Second, we present a continuous imputation framework that consists of data update and model update mechanisms to enable MPIN to perform continuous imputation both effectively and efficiently. Extensive experiments on multiple real datasets show that MPIN can outperform the existing data imputers by wide margins and that the continuous imputation framework is efficient and accurate.
    A Rubik's Cube inspired approach to Clifford synthesis. (arXiv:2307.08684v2 [quant-ph] UPDATED)
    The problem of decomposing an arbitrary Clifford element into a sequence of Clifford gates is known as Clifford synthesis. Drawing inspiration from similarities between this and the famous Rubik's Cube problem, we develop a machine learning approach for Clifford synthesis based on learning an approximation to the distance to the identity. This approach is probabilistic and computationally intensive. However, when a decomposition is successfully found, it often involves fewer gates than existing synthesis algorithms. Additionally, our approach is much more flexible than existing algorithms in that arbitrary gate sets, device topologies, and gate fidelities may incorporated, thus allowing for the approach to be tailored to a specific device.
    Corruption-Robust Offline Reinforcement Learning with General Function Approximation. (arXiv:2310.14550v2 [cs.LG] UPDATED)
    We investigate the problem of corruption robustness in offline reinforcement learning (RL) with general function approximation, where an adversary can corrupt each sample in the offline dataset, and the corruption level $\zeta\geq0$ quantifies the cumulative corruption amount over $n$ episodes and $H$ steps. Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs). Drawing inspiration from the uncertainty-weighting technique from the robust online RL setting \citep{he2022nearly,ye2022corruptionrobust}, we design a new uncertainty weight iteration procedure to efficiently compute on batched samples and propose a corruption-robust algorithm for offline RL. Notably, under the assumption of single policy coverage and the knowledge of $\zeta$, our proposed algorithm achieves a suboptimality bound that is worsened by an additive factor of $\mathcal O(\zeta \cdot (\text{CC}(\lambda,\hat{\mathcal F},\mathcal Z_n^H))^{1/2} (C(\hat{\mathcal F},\mu))^{-1/2} n^{-1})$ due to the corruption. Here $\text{CC}(\lambda,\hat{\mathcal F},\mathcal Z_n^H)$ is the coverage coefficient that depends on the regularization parameter $\lambda$, the confidence set $\hat{\mathcal F}$, and the dataset $\mathcal Z_n^H$, and $C(\hat{\mathcal F},\mu)$ is a coefficient that depends on $\hat{\mathcal F}$ and the underlying data distribution $\mu$. When specialized to linear MDPs, the corruption-dependent error term reduces to $\mathcal O(\zeta d n^{-1})$ with $d$ being the dimension of the feature map, which matches the existing lower bound for corrupted linear MDPs. This suggests that our analysis is tight in terms of the corruption-dependent term.
    Waving Goodbye to Low-Res: A Diffusion-Wavelet Approach for Image Super-Resolution. (arXiv:2304.01994v2 [cs.CV] CROSS LISTED)
    This paper presents a novel Diffusion-Wavelet (DiWa) approach for Single-Image Super-Resolution (SISR). It leverages the strengths of Denoising Diffusion Probabilistic Models (DDPMs) and Discrete Wavelet Transformation (DWT). By enabling DDPMs to operate in the DWT domain, our DDPM models effectively hallucinate high-frequency information for super-resolved images on the wavelet spectrum, resulting in high-quality and detailed reconstructions in image space. Quantitatively, we outperform state-of-the-art diffusion-based SISR methods, namely SR3 and SRDiff, regarding PSNR, SSIM, and LPIPS on both face (8x scaling) and general (4x scaling) SR benchmarks. Meanwhile, using DWT enabled us to use fewer parameters than the compared models: 92M parameters instead of 550M compared to SR3 and 9.3M instead of 12M compared to SRDiff. Additionally, our method outperforms other state-of-the-art generative methods on classical general SR datasets while saving inference time. Finally, our work highlights its potential for various applications.
    Robust and Scalable Hyperdimensional Computing With Brain-Like Neural Adaptations. (arXiv:2311.07705v1 [cs.LG])
    The Internet of Things (IoT) has facilitated many applications utilizing edge-based machine learning (ML) methods to analyze locally collected data. Unfortunately, popular ML algorithms often require intensive computations beyond the capabilities of today's IoT devices. Brain-inspired hyperdimensional computing (HDC) has been introduced to address this issue. However, existing HDCs use static encoders, requiring extremely high dimensionality and hundreds of training iterations to achieve reasonable accuracy. This results in a huge efficiency loss, severely impeding the application of HDCs in IoT systems. We observed that a main cause is that the encoding module of existing HDCs lacks the capability to utilize and adapt to information learned during training. In contrast, neurons in human brains dynamically regenerate all the time and provide more useful functionalities when learning new information. While the goal of HDC is to exploit the high-dimensionality of randomly generated base hypervectors to represent the information as a pattern of neural activity, it remains challenging for existing HDCs to support a similar behavior as brain neural regeneration. In this work, we present dynamic HDC learning frameworks that identify and regenerate undesired dimensions to provide adequate accuracy with significantly lowered dimensionalities, thereby accelerating both the training and inference.
    Detecting and Mitigating System-Level Anomalies of Vision-Based Controllers. (arXiv:2309.13475v2 [cs.RO] UPDATED)
    Autonomous systems, such as self-driving cars and drones, have made significant strides in recent years by leveraging visual inputs and machine learning for decision-making and control. Despite their impressive performance, these vision-based controllers can make erroneous predictions when faced with novel or out-of-distribution inputs. Such errors can cascade to catastrophic system failures and compromise system safety. In this work, we introduce a run-time anomaly monitor to detect and mitigate such closed-loop, system-level failures. Specifically, we leverage a reachability-based framework to stress-test the vision-based controller offline and mine its system-level failures. This data is then used to train a classifier that is leveraged online to flag inputs that might cause system breakdowns. The anomaly detector highlights issues that transcend individual modules and pertain to the safety of the overall system. We also design a fallback controller that robustly handles these detected anomalies to preserve system safety. We validate the proposed approach on an autonomous aircraft taxiing system that uses a vision-based controller for taxiing. Our results show the efficacy of the proposed approach in identifying and handling system-level anomalies, outperforming methods such as prediction error-based detection, and ensembling, thereby enhancing the overall safety and robustness of autonomous systems.
    Imaging through multimode fibres with physical prior. (arXiv:2311.03062v2 [physics.optics] UPDATED)
    Imaging through perturbed multimode fibres based on deep learning has been widely researched. However, existing methods mainly use target-speckle pairs in different configurations. It is challenging to reconstruct targets without trained networks. In this paper, we propose a physics-assisted, unsupervised, learning-based fibre imaging scheme. The role of the physical prior is to simplify the mapping relationship between the speckle pattern and the target image, thereby reducing the computational complexity. The unsupervised network learns target features according to the optimized direction provided by the physical prior. Therefore, the reconstruction process of the online learning only requires a few speckle patterns and unpaired targets. The proposed scheme also increases the generalization ability of the learning-based method in perturbed multimode fibres. Our scheme has the potential to extend the application of multimode fibre imaging.
    Causal Intervention for Measuring Confidence in Drug-Target Interaction Prediction. (arXiv:2306.00041v2 [q-bio.QM] UPDATED)
    Identifying and discovering drug-target interactions(DTIs) are vital steps in drug discovery and development. They play a crucial role in assisting scientists in finding new drugs and accelerating the drug development process. Recently, knowledge graph and knowledge graph embedding (KGE) models have made rapid advancements and demonstrated impressive performance in drug discovery. However, such models lack authenticity and accuracy in drug target identification, leading to an increased misjudgment rate and reduced drug development efficiency. To address these issues, we focus on the problem of drug-target interactions, with knowledge mapping as the core technology. Specifically, a causal intervention-based confidence measure is employed to assess the triplet score to improve the accuracy of the drug-target interaction prediction model. Experimental results demonstrate that the developed confidence measurement method based on causal intervention can significantly enhance the accuracy of DTI link prediction, particularly for high-precision models. The predicted results are more valuable in guiding the design and development of subsequent drug development experiments, thereby significantly improving the efficiency of drug development.
    A Consistent Diffusion-Based Algorithm for Semi-Supervised Graph Learning. (arXiv:2311.07627v1 [cs.LG])
    The task of semi-supervised classification aims at assigning labels to all nodes of a graph based on the labels known for a few nodes, called the seeds. One of the most popular algorithms relies on the principle of heat diffusion, where the labels of the seeds are spread by thermoconductance and the temperature of each node at equilibrium is used as a score function for each label. In this paper, we prove that this algorithm is not consistent unless the temperatures of the nodes at equilibrium are centered before scoring. This crucial step does not only make the algorithm provably consistent on a block model but brings significant performance gains on real graphs.
    A Better Match for Drivers and Riders: Reinforcement Learning at Lyft. (arXiv:2310.13810v2 [cs.LG] UPDATED)
    To better match drivers to riders in our ridesharing application, we revised Lyft's core matching algorithm. We use a novel online reinforcement learning approach that estimates the future earnings of drivers in real time and use this information to find more efficient matches. This change was the first documented implementation of a ridesharing matching algorithm that can learn and improve in real time. We evaluated the new approach during weeks of switchback experimentation in most Lyft markets, and estimated how it benefited drivers, riders, and the platform. In particular, it enabled our drivers to serve millions of additional riders each year, leading to more than $30 million per year in incremental revenue. Lyft rolled out the algorithm globally in 2021.
    Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure. (arXiv:2311.07590v1 [cs.CL])
    We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision. We perform a brief investigation of how this behavior varies under changes to the setting, such as removing model access to a reasoning scratchpad, attempting to prevent the misaligned behavior by changing system instructions, changing the amount of pressure the model is under, varying the perceived risk of getting caught, and making other simple changes to the environment. To our knowledge, this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception.
    Fuse to Forget: Bias Reduction and Selective Memorization through Model Fusion. (arXiv:2311.07682v1 [cs.CL])
    Model fusion research aims to aggregate the knowledge of multiple models to enhance performance by combining their weights. In this work, we study the inverse, investigating whether and how can model fusion interfere and reduce unwanted knowledge. We delve into the effects of model fusion on the evolution of learned shortcuts, social biases, and memorization capabilities in fine-tuned language models. Through several experiments covering text classification and generation tasks, our analysis highlights that shared knowledge among models is usually enhanced during model fusion, while unshared knowledge is usually lost or forgotten. Based on this observation, we demonstrate the potential of model fusion as a debiasing tool and showcase its efficacy in addressing privacy concerns associated with language models.
    Single-Model Attribution of Generative Models Through Final-Layer Inversion. (arXiv:2306.06210v3 [cs.CV] UPDATED)
    Recent breakthroughs in generative modeling have sparked interest in practical single-model attribution. Such methods predict whether a sample was generated by a specific generator or not, for instance, to prove intellectual property theft. However, previous works are either limited to the closed-world setting or require undesirable changes to the generative model. We address these shortcomings by, first, viewing single-model attribution through the lens of anomaly detection. Arising from this change of perspective, we propose FLIPAD, a new approach for single-model attribution in the open-world setting based on final-layer inversion and anomaly detection. We show that the utilized final-layer inversion can be reduced to a convex lasso optimization problem, making our approach theoretically sound and computationally efficient. The theoretical findings are accompanied by an experimental study demonstrating the effectiveness of our approach and its flexibility to various domains.
    GenTKG: Generative Forecasting on Temporal Knowledge Graph. (arXiv:2310.07793v2 [cs.CL] UPDATED)
    The rapid advancements in large language models (LLMs) have ignited interest in the temporal knowledge graph (tKG) domain, where conventional carefully designed embedding-based and rule-based models dominate. The question remains open of whether pre-trained LLMs can understand structured temporal relational data and replace them as the foundation model for temporal relational forecasting. Therefore, we bring temporal knowledge forecasting into the generative setting. However, challenges occur in the huge chasms between complex temporal graph data structure and sequential natural expressions LLMs can handle, and between the enormous data sizes of tKGs and heavy computation costs of finetuning LLMs. To address these challenges, we propose a novel retrieval augmented generation framework that performs generative forecasting on tKGs named GenTKG, which combines a temporal logical rule-based retrieval strategy and lightweight parameter-efficient instruction tuning. Extensive experiments have shown that GenTKG outperforms conventional methods of temporal relational forecasting under low computation resources. GenTKG also highlights remarkable transferability with exceeding performance on unseen datasets without re-training. Our work reveals the huge potential of LLMs in the tKG domain and opens a new frontier for generative forecasting on tKGs.
    Simultaneous Clutter Detection and Semantic Segmentation of Moving Objects for Automotive Radar Data. (arXiv:2311.07247v2 [cs.CV] UPDATED)
    The unique properties of radar sensors, such as their robustness to adverse weather conditions, make them an important part of the environment perception system of autonomous vehicles. One of the first steps during the processing of radar point clouds is often the detection of clutter, i.e. erroneous points that do not correspond to real objects. Another common objective is the semantic segmentation of moving road users. These two problems are handled strictly separate from each other in literature. The employed neural networks are always focused entirely on only one of the tasks. In contrast to this, we examine ways to solve both tasks at the same time with a single jointly used model. In addition to a new augmented multi-head architecture, we also devise a method to represent a network's predictions for the two tasks with only one output value. This novel approach allows us to solve the tasks simultaneously with the same inference time as a conventional task-specific model. In an extensive evaluation, we show that our setup is highly effective and outperforms every existing network for semantic segmentation on the RadarScenes dataset.
    Towards a robust and reliable deep learning approach for detection of compact binary mergers in gravitational wave data. (arXiv:2306.11797v2 [gr-qc] UPDATED)
    The ability of deep learning (DL) approaches to learn generalised signal and noise models, coupled with their fast inference on GPUs, holds great promise for enhancing gravitational-wave (GW) searches in terms of speed, parameter space coverage, and search sensitivity. However, the opaque nature of DL models severely harms their reliability. In this work, we meticulously develop a DL model stage-wise and work towards improving its robustness and reliability. First, we address the problems in maintaining the purity of training data by deriving a new metric that better reflects the visual strength of the 'chirp' signal features in the data. Using a reduced, smooth representation obtained through a variational auto-encoder (VAE), we build a classifier to search for compact binary coalescence (CBC) signals. Our tests on real LIGO data show an impressive performance of the model. However, upon probing the robustness of the model through adversarial attacks, its simple failure modes were identified, underlining how such models can still be highly fragile. As a first step towards bringing robustness, we retrain the model in a novel framework involving a generative adversarial network (GAN). Over the course of training, the model learns to eliminate the primary modes of failure identified by the adversaries. Although absolute robustness is practically impossible to achieve, we demonstrate some fundamental improvements earned through such training, like sparseness and reduced degeneracy in the extracted features at different layers inside the model. We show that these gains are achieved at practically zero loss in terms of model performance on real LIGO data before and after GAN training. Through a direct search on 8.8 days of LIGO data, we recover two significant CBC events from GWTC-2.1, GW190519_153544 and GW190521_074359. We also report the search sensitivity obtained from an injection study.
    Tanimoto Random Features for Scalable Molecular Machine Learning. (arXiv:2306.14809v2 [cs.LG] UPDATED)
    The Tanimoto coefficient is commonly used to measure the similarity between molecules represented as discrete fingerprints, either as a distance metric or a positive definite kernel. While many kernel methods can be accelerated using random feature approximations, at present there is a lack of such approximations for the Tanimoto kernel. In this paper we propose two kinds of novel random features to allow this kernel to scale to large datasets, and in the process discover a novel extension of the kernel to real-valued vectors. We theoretically characterize these random features, and provide error bounds on the spectral norm of the Gram matrix. Experimentally, we show that these random features are effective at approximating the Tanimoto coefficient of real-world datasets and are useful for molecular property prediction and optimization tasks.
    Improving Diffusion Models for ECG Imputation with an Augmented Template Prior. (arXiv:2310.15742v2 [cs.LG] UPDATED)
    Pulsative signals such as the electrocardiogram (ECG) are extensively collected as part of routine clinical care. However, noisy and poor-quality recordings are a major issue for signals collected using mobile health systems, decreasing the signal quality, leading to missing values, and affecting automated downstream tasks. Recent studies have explored the imputation of missing values in ECG with probabilistic time-series models. Nevertheless, in comparison with the deterministic models, their performance is still limited, as the variations across subjects and heart-beat relationships are not explicitly considered in the training objective. In this work, to improve the imputation and forecasting accuracy for ECG with probabilistic models, we present a template-guided denoising diffusion probabilistic model (DDPM), PulseDiff, which is conditioned on an informative prior for a range of health conditions. Specifically, 1) we first extract a subject-level pulsative template from the observed values to use as an informative prior of the missing values, which personalises the prior; 2) we then add beat-level stochastic shift terms to augment the prior, which considers variations in the position and amplitude of the prior at each beat; 3) we finally design a confidence score to consider the health condition of the subject, which ensures our prior is provided safely. Experiments with the PTBXL dataset reveal that PulseDiff improves the performance of two strong DDPM baseline models, CSDI and SSSD$^{S4}$, verifying that our method guides the generation of DDPMs while managing the uncertainty. When combined with SSSD$^{S4}$, PulseDiff outperforms the leading deterministic model for short-interval missing data and is comparable for long-interval data loss.
    Modeling Choice via Self-Attention. (arXiv:2311.07607v1 [cs.AI])
    Models of choice are a fundamental input to many now-canonical optimization problems in the field of Operations Management, including assortment, inventory, and price optimization. Naturally, accurate estimation of these models from data is a critical step in the application of these optimization problems in practice, and so it is perhaps surprising that such choice estimation has to now been accomplished almost exclusively, both in theory and in practice, (a) without the use of deep learning in any meaningful way, and (b) via evaluation on limited data with constantly-changing metrics. This is in stark contrast to the vast majority of similar learning applications, for which the practice of machine learning suggests that (a) neural network-based models are typically state-of-the-art, and (b) strict standardization on evaluation procedures (datasets, metrics, etc.) is crucial. Thus motivated, we first propose a choice model that is the first to successfully (both theoretically and practically) leverage a modern neural network architectural concept (self-attention). Theoretically, we show that our attention-based choice model is a low-rank generalization of the Halo Multinomial Logit model, a recent model that parsimoniously captures irrational choice effects and has seen empirical success. We prove that whereas the Halo-MNL requires $\Omega(m^2)$ data samples to estimate, where $m$ is the number of products, our model supports a natural nonconvex estimator (in particular, that which a standard neural network implementation would apply) which admits a near-optimal stationary point with $O(m)$ samples. We then establish the first realistic-scale benchmark for choice estimation on real data and use this benchmark to run the largest evaluation of existing choice models to date. We find that the model we propose is dominant over both short-term and long-term data periods.
    AutoOptLib: Tailoring Metaheuristic Optimizers via Automated Algorithm Design. (arXiv:2303.06536v2 [cs.NE] UPDATED)
    Metaheuristics are prominent gradient-free optimizers for solving hard problems that do not meet the rigorous mathematical assumptions of analytical solvers. The canonical manual optimizer design could be laborious, untraceable and error-prone, let alone human experts are not always available. This arises increasing interest and demand in automating the optimizer design process. In response, this paper proposes AutoOptLib, the first platform for accessible automated design of metaheuristic optimizers. AutoOptLib leverages computing resources to conceive, build up, and verify the design choices of the optimizers. It requires much less labor resources and expertise than manual design, democratizing satisfactory metaheuristic optimizers to a much broader range of researchers and practitioners. Furthermore, by fully exploring the design choices with computing resources, AutoOptLib has the potential to surpass human experience, subsequently gaining enhanced performance compared with human problem-solving. To realize the automated design, AutoOptLib provides 1) a rich library of metaheuristic components for continuous, discrete, and permutation problems; 2) a flexible algorithm representation for evolving diverse algorithm structures; 3) different design objectives and techniques for different optimization scenarios; and 4) a graphic user interface for accessibility and practicability. AutoOptLib is fully written in Matlab/Octave; its source code and documentation are available at https://github.com/qz89/AutoOpt and https://AutoOpt.readthedocs.io/, respectively.
    FinGPT: Democratizing Internet-scale Data for Financial Large Language Models. (arXiv:2307.10485v2 [cs.CL] UPDATED)
    Large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like texts, which may potentially revolutionize the finance industry. However, existing LLMs often fall short in the financial field, which is mainly attributed to the disparities between general text data and financial text data. Unfortunately, there is only a limited number of financial text datasets available, and BloombergGPT, the first financial LLM (FinLLM), is close-sourced (only the training logs were released). In light of this, we aim to democratize Internet-scale financial data for LLMs, which is an open challenge due to diverse data sources, low signal-to-noise ratio, and high time-validity. To address the challenges, we introduce an open-sourced and data-centric framework, Financial Generative Pre-trained Transformer (FinGPT), that automates the collection and curation of real-time financial data from 34 diverse sources on the Internet, providing researchers and practitioners with accessible and transparent resources to develop their FinLLMs. Additionally, we propose a simple yet effective strategy for fine-tuning FinLLM using the inherent feedback from the market, dubbed Reinforcement Learning with Stock Prices (RLSP). We also adopt the Low-rank Adaptation (LoRA, QLoRA) method that enables users to customize their own FinLLMs from general-purpose LLMs at a low cost. Finally, we showcase several FinGPT applications, including robo-advisor, sentiment analysis for algorithmic trading, and low-code development. FinGPT aims to democratize FinLLMs, stimulate innovation, and unlock new opportunities in open finance. The codes have been open-sourced.
    PAC-Bayesian Generalization Bounds for Adversarial Generative Models. (arXiv:2302.08942v4 [cs.LG] UPDATED)
    We extend PAC-Bayesian theory to generative models and develop generalization bounds for models based on the Wasserstein distance and the total variation distance. Our first result on the Wasserstein distance assumes the instance space is bounded, while our second result takes advantage of dimensionality reduction. Our results naturally apply to Wasserstein GANs and Energy-Based GANs, and our bounds provide new training objectives for these two. Although our work is mainly theoretical, we perform numerical experiments showing non-vacuous generalization bounds for Wasserstein GANs on synthetic datasets.
    Empowering Many, Biasing a Few: Generalist Credit Scoring through Large Language Models. (arXiv:2310.00566v2 [cs.LG] UPDATED)
    In the financial industry, credit scoring is a fundamental element, shaping access to credit and determining the terms of loans for individuals and businesses alike. Traditional credit scoring methods, however, often grapple with challenges such as narrow knowledge scope and isolated evaluation of credit tasks. Our work posits that Large Language Models (LLMs) have great potential for credit scoring tasks, with strong generalization ability across multiple tasks. To systematically explore LLMs for credit scoring, we propose the first open-source comprehensive framework. We curate a novel benchmark covering 9 datasets with 14K samples, tailored for credit assessment and a critical examination of potential biases within LLMs, and the novel instruction tuning data with over 45k samples. We then propose the first Credit and Risk Assessment Large Language Model (CALM) by instruction tuning, tailored to the nuanced demands of various financial risk assessment tasks. We evaluate CALM, and existing state-of-art (SOTA) open source and close source LLMs on the build benchmark. Our empirical results illuminate the capability of LLMs to not only match but surpass conventional models, pointing towards a future where credit scoring can be more inclusive, comprehensive, and unbiased. We contribute to the industry's transformation by sharing our pioneering instruction-tuning datasets, credit and risk assessment LLM, and benchmarks with the research community and the financial industry.
    A Metacognitive Approach to Out-of-Distribution Detection for Segmentation. (arXiv:2311.07578v1 [cs.CV])
    Despite outstanding semantic scene segmentation in closed-worlds, deep neural networks segment novel instances poorly, which is required for autonomous agents acting in an open world. To improve out-of-distribution (OOD) detection for segmentation, we introduce a metacognitive approach in the form of a lightweight module that leverages entropy measures, segmentation predictions, and spatial context to characterize the segmentation model's uncertainty and detect pixel-wise OOD data in real-time. Additionally, our approach incorporates a novel method of generating synthetic OOD data in context with in-distribution data, which we use to fine-tune existing segmentation models with maximum entropy training. This further improves the metacognitive module's performance without requiring access to OOD data while enabling compatibility with established pre-trained models. Our resulting approach can reliably detect OOD instances in a scene, as shown by state-of-the-art performance on OOD detection for semantic segmentation benchmarks.
    REST: Retrieval-Based Speculative Decoding. (arXiv:2311.08252v1 [cs.CL])
    We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that the process of text generation often includes certain common phases and patterns. Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieval to generate draft tokens. This method draws from the reservoir of existing knowledge, retrieving and employing relevant tokens based on the current context. Its plug-and-play nature allows for seamless integration and acceleration of any language models, all without necessitating additional training. When benchmarked on 7B and 13B language models in a single-batch setting, REST achieves a significant speedup of 1.62X to 2.36X on code or text generation. The code of REST is available at https://github.com/FasterDecoding/REST.
    Simplifying and Understanding State Space Models with Diagonal Linear RNNs. (arXiv:2212.00768v3 [cs.LG] UPDATED)
    Sequence models based on linear state spaces (SSMs) have recently emerged as a promising choice of architecture for modeling long range dependencies across various modalities. However, they invariably rely on discretization of a continuous state space, which complicates their presentation and understanding. In this work, we dispose of the discretization step, and propose a model based on vanilla Diagonal Linear RNNs ($\mathrm{DLR}$). We empirically show that, despite being conceptually much simpler, $\mathrm{DLR}$ is as performant as previously-proposed SSMs on a variety of tasks and benchmarks including Long Range Arena and raw speech classification. Moreover, we characterize the expressivity of SSMs (including $\mathrm{DLR}$) and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks involving interactions over tens of thousands of tokens, ranging from simple operations, such as shifting an input sequence, to detecting co-dependent visual features over long spatial ranges in flattened images. We find that while SSMs report near-perfect performance on tasks that can be modeled via $\textit{few}$ convolutional kernels, they struggle on tasks requiring $\textit{many}$ such kernels and especially when the desired sequence manipulation is $\textit{context-dependent}$. Despite these limitations, $\mathrm{DLR}$ reaches high performance on two higher-order reasoning tasks $\mathrm{ListOpsSubTrees}$ and $\mathrm{PathfinderSegmentation}\text{-}\mathrm{256}$ with input lengths $8K$ and $65K$ respectively, and gives encouraging performance on $\mathrm{PathfinderSegmentation}\text{-}\mathrm{512}$ with input length $262K$ for which attention is not a viable choice.
    Generalized partitioned local depth. (arXiv:2303.10167v4 [stat.ML] UPDATED)
    In this paper we provide a generalization of the concept of cohesion as introduced recently by Berenhaut, Moore and Melvin [Proceedings of the National Academy of Sciences, 119 (4) (2022)]. The formulation presented builds on the technique of partitioned local depth by distilling two key probabilistic concepts: local relevance and support division. Earlier results are extended within the new context, and examples of applications to revealing communities in data with uncertainty are included. The work sheds light on the foundations of partitioned local depth, and extends the original ideas to enable probabilistic consideration of uncertain, variable and potentially conflicting information.
    Review of compressed embedding layers and their applications for recommender systems. (arXiv:2306.13724v2 [cs.LG] UPDATED)
    We review the literature on trainable, compressed embedding layers and discuss their applicability for compressing gigantic neural recommender systems. We also report the results we measured with our compressed embedding layers.
    ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly Detection. (arXiv:2307.00754v2 [cs.LG] UPDATED)
    Anomaly detection in multivariate time series data is of paramount importance for ensuring the efficient operation of large-scale systems across diverse domains. However, accurately detecting anomalies in such data poses significant challenges. Existing approaches, including forecasting and reconstruction-based methods, struggle to address these challenges effectively. To overcome these limitations, we propose a novel anomaly detection framework named ImDiffusion, which combines time series imputation and diffusion models to achieve accurate and robust anomaly detection. The imputation-based approach employed by ImDiffusion leverages the information from neighboring values in the time series, enabling precise modeling of temporal and inter-correlated dependencies, reducing uncertainty in the data, thereby enhancing the robustness of the anomaly detection process. ImDiffusion further leverages diffusion models as time series imputers to accurately capturing complex dependencies. We leverage the step-by-step denoised outputs generated during the inference process to serve as valuable signals for anomaly prediction, resulting in improved accuracy and robustness of the detection process. We evaluate the performance of ImDiffusion via extensive experiments on benchmark datasets. The results demonstrate that our proposed framework significantly outperforms state-of-the-art approaches in terms of detection accuracy and timeliness. ImDiffusion is further integrated into the real production system in Microsoft and observe a remarkable 11.4% increase in detection F1 score compared to the legacy approach. To the best of our knowledge, ImDiffusion represents a pioneering approach that combines imputation-based techniques with time series anomaly detection, while introducing the novel use of diffusion models to the field.
    An efficient semi-supervised quality control system trained using physics-based MRI-artefact generators and adversarial training. (arXiv:2206.03359v2 [eess.IV] UPDATED)
    Large medical imaging data sets are becoming increasingly available, but ensuring sample quality without significant artefacts is challenging. Existing methods for identifying imperfections in medical imaging rely on data-intensive approaches, compounded by a scarcity of artefact-rich scans for training machine learning models in clinical research. To tackle this problem, we propose a framework with four main components: 1) artefact generators inspired by magnetic resonance physics to corrupt brain MRI scans and augment a training dataset, 2) abstract and engineered features to represent images compactly, 3) a feature selection process depending on the artefact class to improve classification, and 4) SVM classifiers to identify artefacts. Our contributions are threefold: first, physics-based artefact generators produce synthetic brain MRI scans with controlled artefacts for data augmentation. This will avoid the labour-intensive collection and labelling process of scans with rare artefacts. Second, we propose a pool of abstract and engineered image features to identify 9 different artefacts for structural MRI. Finally, we use an artefact-based feature selection block that, for each class of artefacts, finds the set of features providing the best classification performance. We performed validation experiments on a large data set of scans with artificially-generated artefacts, and in a multiple sclerosis clinical trial where real artefacts were identified by experts, showing that the proposed pipeline outperforms traditional methods. In particular, our data augmentation increases performance by up to 12.5 percentage points on accuracy, precision, and recall. The computational efficiency of our pipeline enables potential real-time deployment, promising high-throughput clinical applications through automated image-processing pipelines driven by quality control systems.
    The convergence of the Stochastic Gradient Descent (SGD) : a self-contained proof. (arXiv:2103.14350v2 [stat.ML] UPDATED)
    We give here a proof of the convergence of the Stochastic Gradient Descent (SGD) in a self-contained manner.
    Improved Beam Search for Hallucination Mitigation in Abstractive Summarization. (arXiv:2212.02712v2 [cs.CL] UPDATED)
    Advancement in large pretrained language models has significantly improved their performance for conditional language generation tasks including summarization albeit with hallucinations. To reduce hallucinations, conventional methods proposed improving beam search or using a fact checker as a postprocessing step. In this paper, we investigate the use of the Natural Language Inference (NLI) entailment metric to detect and prevent hallucinations in summary generation. We propose an NLI-assisted beam re-ranking mechanism by computing entailment probability scores between the input context and summarization model-generated beams during saliency-enhanced greedy decoding. Moreover, a diversity metric is introduced to compare its effectiveness against vanilla beam search. Our proposed algorithm significantly outperforms vanilla beam decoding on XSum and CNN/DM datasets.
    Visualizing the Diversity of Representations Learned by Bayesian Neural Networks. (arXiv:2201.10859v2 [cs.LG] UPDATED)
    Explainable Artificial Intelligence (XAI) aims to make learning machines less opaque, and offers researchers and practitioners various tools to reveal the decision-making strategies of neural networks. In this work, we investigate how XAI methods can be used for exploring and visualizing the diversity of feature representations learned by Bayesian Neural Networks (BNNs). Our goal is to provide a global understanding of BNNs by making their decision-making strategies a) visible and tangible through feature visualizations and b) quantitatively measurable with a distance measure learned by contrastive learning. Our work provides new insights into the \emph{posterior} distribution in terms of human-understandable feature information with regard to the underlying decision making strategies. The main findings of our work are the following: 1) global XAI methods can be applied to explain the diversity of decision-making strategies of BNN instances, 2) Monte Carlo dropout with commonly used Dropout rates exhibit increased diversity in feature representations compared to the multimodal posterior approximation of MultiSWAG, 3) the diversity of learned feature representations highly correlates with the uncertainty estimate for the output and 4) the inter-mode diversity of the multimodal posterior decreases as the network width increases, while the intra mode diversity increases. These findings are consistent with the recent Deep Neural Networks theory, providing additional intuitions about what the theory implies in terms of humanly understandable concepts.
    Mobility-Induced Graph Learning for WiFi Positioning. (arXiv:2311.08271v1 [cs.LG])
    A smartphone-based user mobility tracking could be effective in finding his/her location, while the unpredictable error therein due to low specification of built-in inertial measurement units (IMUs) rejects its standalone usage but demands the integration to another positioning technique like WiFi positioning. This paper aims to propose a novel integration technique using a graph neural network called Mobility-INduced Graph LEarning (MINGLE), which is designed based on two types of graphs made by capturing different user mobility features. Specifically, considering sequential measurement points (MPs) as nodes, a user's regular mobility pattern allows us to connect neighbor MPs as edges, called time-driven mobility graph (TMG). Second, a user's relatively straight transition at a constant pace when moving from one position to another can be captured by connecting the nodes on each path, called a direction-driven mobility graph (DMG). Then, we can design graph convolution network (GCN)-based cross-graph learning, where two different GCN models for TMG and DMG are jointly trained by feeding different input features created by WiFi RTTs yet sharing their weights. Besides, the loss function includes a mobility regularization term such that the differences between adjacent location estimates should be less variant due to the user's stable moving pace. Noting that the regularization term does not require ground-truth location, MINGLE can be designed under semi- and self-supervised learning frameworks. The proposed MINGLE's effectiveness is extensively verified through field experiments, showing a better positioning accuracy than benchmarks, say root mean square errors (RMSEs) being 1.398 (m) and 1.073 (m) for self- and semi-supervised learning cases, respectively.
    Inferring Causal Effects Under Heterogeneous Peer Influence. (arXiv:2305.17479v2 [cs.SI] UPDATED)
    Causal inference in networks should account for interference, which occurs when a unit's outcome is influenced by treatments or outcomes of peers. Heterogeneous peer influence (HPI) occurs when a unit's outcome is influenced differently by different peers based on their attributes and relationships, or when each unit has a different susceptibility to peer influence. Existing solutions to estimating direct causal effects under interference consider either homogeneous influence from peers or specific heterogeneous influence mechanisms (e.g., based on local neighborhood structure). This paper presents a methodology for estimating individual direct causal effects in the presence of HPI where the mechanism of influence is not known a priori. We propose a structural causal model for networks that can capture different possible assumptions about network structure, interference conditions, and causal dependence and enables reasoning about identifiability in the presence of HPI. We find potential heterogeneous contexts using the causal model and propose a novel graph neural network-based estimator to estimate individual direct causal effects. We show that state-of-the-art methods for individual direct effect estimation produce biased results in the presence of HPI, and that our proposed estimator is robust.
    Understanding learning from EEG data: Combining machine learning and feature engineering based on hidden Markov models and mixed models. (arXiv:2311.08113v1 [q-bio.QM])
    Theta oscillations, ranging from 4-8 Hz, play a significant role in spatial learning and memory functions during navigation tasks. Frontal theta oscillations are thought to play an important role in spatial navigation and memory. Electroencephalography (EEG) datasets are very complex, making any changes in the neural signal related to behaviour difficult to interpret. However, multiple analytical methods are available to examine complex data structure, especially machine learning based techniques. These methods have shown high classification performance and the combination with feature engineering enhances the capability of these methods. This paper proposes using hidden Markov and linear mixed effects models to extract features from EEG data. Based on the engineered features obtained from frontal theta EEG data during a spatial navigation task in two key trials (first, last) and between two conditions (learner and non-learner), we analysed the performance of six machine learning methods (Polynomial Support Vector Machines, Non-linear Support Vector Machines, Random Forests, K-Nearest Neighbours, Ridge, and Deep Neural Networks) on classifying learner and non-learner participants. We also analysed how different standardisation methods used to pre-process the EEG data contribute to classification performance. We compared the classification performance of each trial with data gathered from the same subjects, including solely coordinate-based features, such as idle time and average speed. We found that more machine learning methods perform better classification using coordinate-based data. However, only deep neural networks achieved an area under the ROC curve higher than 80% using the theta EEG data alone. Our findings suggest that standardising the theta EEG data and using deep neural networks enhances the classification of learner and non-learner subjects in a spatial learning task.
    Neural Lattice Reduction: A Self-Supervised Geometric Deep Learning Approach. (arXiv:2311.08170v1 [cs.LG])
    Lattice reduction is a combinatorial optimization problem aimed at finding the most orthogonal basis in a given lattice. In this work, we address lattice reduction via deep learning methods. We design a deep neural model outputting factorized unimodular matrices and train it in a self-supervised manner by penalizing non-orthogonal lattice bases. We incorporate the symmetries of lattice reduction into the model by making it invariant and equivariant with respect to appropriate continuous and discrete groups.
    Diffused Redundancy in Pre-trained Representations. (arXiv:2306.00183v3 [cs.LG] UPDATED)
    Representations learned by pre-training a neural network on a large dataset are increasingly used successfully to perform a variety of downstream tasks. In this work, we take a closer look at how features are encoded in such pre-trained representations. We find that learned representations in a given layer exhibit a degree of diffuse redundancy, ie, any randomly chosen subset of neurons in the layer that is larger than a threshold size shares a large degree of similarity with the full layer and is able to perform similarly as the whole layer on a variety of downstream tasks. For example, a linear probe trained on $20\%$ of randomly picked neurons from the penultimate layer of a ResNet50 pre-trained on ImageNet1k achieves an accuracy within $5\%$ of a linear probe trained on the full layer of neurons for downstream CIFAR10 classification. We conduct experiments on different neural architectures (including CNNs and Transformers) pre-trained on both ImageNet1k and ImageNet21k and evaluate a variety of downstream tasks taken from the VTAB benchmark. We find that the loss and dataset used during pre-training largely govern the degree of diffuse redundancy and the "critical mass" of neurons needed often depends on the downstream task, suggesting that there is a task-inherent redundancy-performance Pareto frontier. Our findings shed light on the nature of representations learned by pre-trained deep neural networks and suggest that entire layers might not be necessary to perform many downstream tasks. We investigate the potential for exploiting this redundancy to achieve efficient generalization for downstream tasks and also draw caution to certain possible unintended consequences. Our code is available at \url{https://github.com/nvedant07/diffused-redundancy}.
    AuthentiGPT: Detecting Machine-Generated Text via Black-Box Language Models Denoising. (arXiv:2311.07700v1 [cs.CL])
    Large language models (LLMs) have opened up enormous opportunities while simultaneously posing ethical dilemmas. One of the major concerns is their ability to create text that closely mimics human writing, which can lead to potential misuse, such as academic misconduct, disinformation, and fraud. To address this problem, we present AuthentiGPT, an efficient classifier that distinguishes between machine-generated and human-written texts. Under the assumption that human-written text resides outside the distribution of machine-generated text, AuthentiGPT leverages a black-box LLM to denoise input text with artificially added noise, and then semantically compares the denoised text with the original to determine if the content is machine-generated. With only one trainable parameter, AuthentiGPT eliminates the need for a large training dataset, watermarking the LLM's output, or computing the log-likelihood. Importantly, the detection capability of AuthentiGPT can be easily adapted to any generative language model. With a 0.918 AUROC score on a domain-specific dataset, AuthentiGPT demonstrates its effectiveness over other commercial algorithms, highlighting its potential for detecting machine-generated text in academic settings.
    Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning. (arXiv:2311.08182v1 [cs.CL])
    Enhancing the instruction-following ability of Large Language Models (LLMs) primarily demands substantial instruction-tuning datasets. However, the sheer volume of these imposes a considerable computational burden and annotation cost. To investigate a label-efficient instruction tuning method that allows the model itself to actively sample subsets that are equally or even more effective, we introduce a self-evolving mechanism DiverseEvol. In this process, a model iteratively augments its training subset to refine its own performance, without requiring any intervention from humans or more advanced LLMs. The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets, as the model selects new data points most distinct from any existing ones according to its current embedding space. Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol. Our models, trained on less than 8% of the original dataset, maintain or improve performance compared with finetuning on full data. We also provide empirical evidence to analyze the importance of diversity in instruction data and the iterative scheme as opposed to one-time sampling. Our code is publicly available at https://github.com/OFA-Sys/DiverseEvol.git.
    Winner Takes It All: Training Performant RL Populations for Combinatorial Optimization. (arXiv:2210.03475v2 [cs.AI] UPDATED)
    Applying reinforcement learning (RL) to combinatorial optimization problems is attractive as it removes the need for expert knowledge or pre-solved instances. However, it is unrealistic to expect an agent to solve these (often NP-)hard problems in a single shot at inference due to their inherent complexity. Thus, leading approaches often implement additional search strategies, from stochastic sampling and beam search to explicit fine-tuning. In this paper, we argue for the benefits of learning a population of complementary policies, which can be simultaneously rolled out at inference. To this end, we introduce Poppy, a simple training procedure for populations. Instead of relying on a predefined or hand-crafted notion of diversity, Poppy induces an unsupervised specialization targeted solely at maximizing the performance of the population. We show that Poppy produces a set of complementary policies, and obtains state-of-the-art RL results on four popular NP-hard problems: traveling salesman, capacitated vehicle routing, 0-1 knapsack, and job-shop scheduling.
    How good are Large Language Models on African Languages?. (arXiv:2311.07978v1 [cs.CL])
    Recent advancements in natural language processing have led to the proliferation of large language models (LLMs). These models have been shown to yield good performance, using in-context learning, even on unseen tasks and languages. Additionally, they have been widely adopted as language-model-as-a-service commercial APIs like GPT-4 API. However, their performance on African languages is largely unknown. We present an analysis of three popular large language models (mT0, LLaMa 2, and GPT-4) on five tasks (news topic classification, sentiment classification, machine translation, question answering, and named entity recognition) across 30 African languages, spanning different language families and geographical regions. Our results suggest that all LLMs produce below-par performance on African languages, and there is a large gap in performance compared to high-resource languages like English most tasks. We find that GPT-4 has an average or impressive performance on classification tasks but very poor results on generative tasks like machine translation. Surprisingly, we find that mT0 had the best overall on cross-lingual QA, better than the state-of-the-art supervised model (i.e. fine-tuned mT5) and GPT-4 on African languages. Overall, LLaMa 2 records the worst performance due to its limited multilingual capabilities and English-centric pre-training corpus. In general, our findings present a call-to-action to ensure African languages are well represented in large language models, given their growing popularity.
    Probing clustering in neural network representations. (arXiv:2311.07864v1 [cs.LG])
    Neural network representations contain structure beyond what was present in the training labels. For instance, representations of images that are visually or semantically similar tend to lie closer to each other than to dissimilar images, regardless of their labels. Clustering these representations can thus provide insights into dataset properties as well as the network internals. In this work, we study how the many design choices involved in neural network training affect the clusters formed in the hidden representations. To do so, we establish an evaluation setup based on the BREEDS hierarchy, for the task of subclass clustering after training models with only superclass information. We isolate the training dataset and architecture as important factors affecting clusterability. Datasets with labeled classes consisting of unrelated subclasses yield much better clusterability than those following a natural hierarchy. When using pretrained models to cluster representations on downstream datasets, models pretrained on subclass labels provide better clusterability than models pretrained on superclass labels, but only when there is a high degree of domain overlap between the pretraining and downstream data. Architecturally, we find that normalization strategies affect which layers yield the best clustering performance, and, surprisingly, Vision Transformers attain lower subclass clusterability than ResNets.
    Cattle Identification Using Muzzle Images and Deep Learning Techniques. (arXiv:2311.08148v1 [cs.CV])
    Traditional animal identification methods such as ear-tagging, ear notching, and branding have been effective but pose risks to the animal and have scalability issues. Electrical methods offer better tracking and monitoring but require specialized equipment and are susceptible to attacks. Biometric identification using time-immutable dermatoglyphic features such as muzzle prints and iris patterns is a promising solution. This project explores cattle identification using 4923 muzzle images collected from 268 beef cattle. Two deep learning classification models are implemented - wide ResNet50 and VGG16\_BN and image compression is done to lower the image quality and adapt the models to work for the African context. From the experiments run, a maximum accuracy of 99.5\% is achieved while using the wide ResNet50 model with a compression retaining 25\% of the original image. From the study, it is noted that the time required by the models to train and converge as well as recognition time are dependent on the machine used to run the model.
    Bring Your Own KG: Self-Supervised Program Synthesis for Zero-Shot KGQA. (arXiv:2311.07850v1 [cs.CL])
    We present BYOKG, a universal question-answering (QA) system that can operate on any knowledge graph (KG), requires no human-annotated training data, and can be ready to use within a day -- attributes that are out-of-scope for current KGQA systems. BYOKG draws inspiration from the remarkable ability of humans to comprehend information present in an unseen KG through exploration -- starting at random nodes, inspecting the labels of adjacent nodes and edges, and combining them with their prior world knowledge. In BYOKG, exploration leverages an LLM-backed symbolic agent that generates a diverse set of query-program exemplars, which are then used to ground a retrieval-augmented reasoning procedure to predict programs for arbitrary questions. BYOKG is effective over both small- and large-scale graphs, showing dramatic gains in QA accuracy over a zero-shot baseline of 27.89 and 58.02 F1 on GrailQA and MetaQA, respectively. On GrailQA, we further show that our unsupervised BYOKG outperforms a supervised in-context learning method, demonstrating the effectiveness of exploration. Lastly, we find that performance of BYOKG reliably improves with continued exploration as well as improvements in the base LLM, notably outperforming a state-of-the-art fine-tuned model by 7.08 F1 on a sub-sampled zero-shot split of GrailQA.
    Data-driven building energy efficiency prediction based on envelope heat losses using physics-informed neural networks. (arXiv:2311.08035v1 [cs.LG])
    The analytical prediction of building energy performance in residential buildings based on the heat losses of its individual envelope components is a challenging task. It is worth noting that this field is still in its infancy, with relatively limited research conducted in this specific area to date, especially when it comes for data-driven approaches. In this paper we introduce a novel physics-informed neural network model for addressing this problem. Through the employment of unexposed datasets that encompass general building information, audited characteristics, and heating energy consumption, we feed the deep learning model with general building information, while the model's output consists of the structural components and several thermal properties that are in fact the basic elements of an energy performance certificate (EPC). On top of this neural network, a function, based on physics equations, calculates the energy consumption of the building based on heat losses and enhances the loss function of the deep learning model. This methodology is tested on a real case study for 256 buildings located in Riga, Latvia. Our investigation comes up with promising results in terms of prediction accuracy, paving the way for automated, and data-driven energy efficiency performance prediction based on basic properties of the building, contrary to exhaustive energy efficiency audits led by humans, which are the current status quo.
    MuST: Multimodal Spatiotemporal Graph-Transformer for Hospital Readmission Prediction. (arXiv:2311.07608v1 [cs.LG])
    Hospital readmission prediction is considered an essential approach to decreasing readmission rates, which is a key factor in assessing the quality and efficacy of a healthcare system. Previous studies have extensively utilized three primary modalities, namely electronic health records (EHR), medical images, and clinical notes, to predict hospital readmissions. However, the majority of these studies did not integrate information from all three modalities or utilize the spatiotemporal relationships present in the dataset. This study introduces a novel model called the Multimodal Spatiotemporal Graph-Transformer (MuST) for predicting hospital readmissions. By employing Graph Convolution Networks and temporal transformers, we can effectively capture spatial and temporal dependencies in EHR and chest radiographs. We then propose a fusion transformer to combine the spatiotemporal features from the two modalities mentioned above with the features from clinical notes extracted by a pre-trained, domain-specific transformer. We assess the effectiveness of our methods using the latest publicly available dataset, MIMIC-IV. The experimental results indicate that the inclusion of multimodal features in MuST improves its performance in comparison to unimodal methods. Furthermore, our proposed pipeline outperforms the current leading methods in the prediction of hospital readmissions.
    FedOpenHAR: Federated Multi-Task Transfer Learning for Sensor-Based Human Activity Recognition. (arXiv:2311.07765v1 [cs.LG])
    Motion sensors integrated into wearable and mobile devices provide valuable information about the device users. Machine learning and, recently, deep learning techniques have been used to characterize sensor data. Mostly, a single task, such as recognition of activities, is targeted, and the data is processed centrally at a server or in a cloud environment. However, the same sensor data can be utilized for multiple tasks and distributed machine-learning techniques can be used without the requirement of the transmission of data to a centre. This paper explores Federated Transfer Learning in a Multi-Task manner for both sensor-based human activity recognition and device position identification tasks. The OpenHAR framework is used to train the models, which contains ten smaller datasets. The aim is to obtain model(s) applicable for both tasks in different datasets, which may include only some label types. Multiple experiments are carried in the Flower federated learning environment using the DeepConvLSTM architecture. Results are presented for federated and centralized versions under different parameters and restrictions. By utilizing transfer learning and training a task-specific and personalized federated model, we obtained a similar accuracy with training each client individually and higher accuracy than a fully centralized approach.
    MOPRD: A multidisciplinary open peer review dataset. (arXiv:2212.04972v2 [cs.DL] UPDATED)
    Open peer review is a growing trend in academic publications. Public access to peer review data can benefit both the academic and publishing communities. It also serves as a great support to studies on review comment generation and further to the realization of automated scholarly paper review. However, most of the existing peer review datasets do not provide data that cover the whole peer review process. Apart from this, their data are not diversified enough as the data are mainly collected from the field of computer science. These two drawbacks of the currently available peer review datasets need to be addressed to unlock more opportunities for related studies. In response, we construct MOPRD, a multidisciplinary open peer review dataset. This dataset consists of paper metadata, multiple version manuscripts, review comments, meta-reviews, author's rebuttal letters, and editorial decisions. Moreover, we propose a modular guided review comment generation method based on MOPRD. Experiments show that our method delivers better performance as indicated by both automatic metrics and human evaluation. We also explore other potential applications of MOPRD, including meta-review generation, editorial decision prediction, author rebuttal generation, and scientometric analysis. MOPRD is a strong endorsement for further studies in peer review-related research and other applications.
    Act-VIT: A Representationally Robust Attention Architecture for Skeleton Based Action Recognition Using Vision Transformer. (arXiv:2311.08094v1 [cs.CV])
    Skeleton-based action recognition receives the attention of many researchers as it is robust to viewpoint and illumination changes, and its processing is much more efficient than video frames. With the emergence of deep learning models, it has become very popular to represent the skeleton data in pseudo-image form and apply Convolutional Neural Networks for action recognition. Thereafter, studies concentrated on finding effective methods for forming pseudo-images. Recently, attention networks, more specifically transformers have provided promising results in various vision problems. In this study, the effectiveness of vision transformers for skeleton-based action recognition is examined and its robustness on the pseudo-image representation scheme is investigated. To this end, a three-level architecture, Act-VIT is proposed, which forms a set of pseudo images apply a classifier on each of the representation and combine their results to find the final action class. The classifiers of Act-VIT are first realized by CNNs and then by VITs and their performances are compared. Experimental studies reveal that the vision transformer is less sensitive to the initial pseudo-image representation compared to CNN. Nevertheless, even with the vision transformer, the recognition performance can be further improved by consensus of classifiers.
    Velocity-Based Channel Charting with Spatial Distribution Map Matching. (arXiv:2311.08016v1 [eess.SP])
    Fingerprint-based localization improves the positioning performance in challenging, non-line-of-sight (NLoS) dominated indoor environments. However, fingerprinting models require an expensive life-cycle management including recording and labeling of radio signals for the initial training and regularly at environmental changes. Alternatively, channel-charting avoids this labeling effort as it implicitly associates relative coordinates to the recorded radio signals. Then, with reference real-world coordinates (positions) we can use such charts for positioning tasks. However, current channel-charting approaches lag behind fingerprinting in their positioning accuracy and still require reference samples for localization, regular data recording and labeling to keep the models up to date. Hence, we propose a novel framework that does not require reference positions. We only require information from velocity information, e.g., from pedestrian dead reckoning or odometry to model the channel charts, and topological map information, e.g., a building floor plan, to transform the channel charts into real coordinates. We evaluate our approach on two different real-world datasets using 5G and distributed single-input/multiple-output system (SIMO) radio systems. Our experiments show that even with noisy velocity estimates and coarse map information, we achieve similar position accuracies
    Higher-Order Expander Graph Propagation. (arXiv:2311.07966v1 [cs.LG])
    Graph neural networks operate on graph-structured data via exchanging messages along edges. One limitation of this message passing paradigm is the over-squashing problem. Over-squashing occurs when messages from a node's expanded receptive field are compressed into fixed-size vectors, potentially causing information loss. To address this issue, recent works have explored using expander graphs, which are highly-connected sparse graphs with low diameters, to perform message passing. However, current methods on expander graph propagation only consider pair-wise interactions, ignoring higher-order structures in complex data. To explore the benefits of capturing these higher-order correlations while still leveraging expander graphs, we introduce higher-order expander graph propagation. We propose two methods for constructing bipartite expanders and evaluate their performance on both synthetic and real-world datasets.
    Out-of-Distribution Knowledge Distillation via Confidence Amendment. (arXiv:2311.07975v1 [cs.LG])
    Out-of-distribution (OOD) detection is essential in identifying test samples that deviate from the in-distribution (ID) data upon which a standard network is trained, ensuring network robustness and reliability. This paper introduces OOD knowledge distillation, a pioneering learning framework applicable whether or not training ID data is available, given a standard network. This framework harnesses OOD-sensitive knowledge from the standard network to craft a binary classifier adept at distinguishing between ID and OOD samples. To accomplish this, we introduce Confidence Amendment (CA), an innovative methodology that transforms an OOD sample into an ID one while progressively amending prediction confidence derived from the standard network. This approach enables the simultaneous synthesis of both ID and OOD samples, each accompanied by an adjusted prediction confidence, thereby facilitating the training of a binary classifier sensitive to OOD. Theoretical analysis provides bounds on the generalization error of the binary classifier, demonstrating the pivotal role of confidence amendment in enhancing OOD sensitivity. Extensive experiments spanning various datasets and network architectures confirm the efficacy of the proposed method in detecting OOD samples.
    Towards Improving Robustness Against Common Corruptions in Object Detectors Using Adversarial Contrastive Learning. (arXiv:2311.07928v1 [cs.CV])
    Neural networks have revolutionized various domains, exhibiting remarkable accuracy in tasks like natural language processing and computer vision. However, their vulnerability to slight alterations in input samples poses challenges, particularly in safety-critical applications like autonomous driving. Current approaches, such as introducing distortions during training, fall short in addressing unforeseen corruptions. This paper proposes an innovative adversarial contrastive learning framework to enhance neural network robustness simultaneously against adversarial attacks and common corruptions. By generating instance-wise adversarial examples and optimizing contrastive loss, our method fosters representations that resist adversarial perturbations and remain robust in real-world scenarios. Subsequent contrastive learning then strengthens the similarity between clean samples and their adversarial counterparts, fostering representations resistant to both adversarial attacks and common distortions. By focusing on improving performance under adversarial and real-world conditions, our approach aims to bolster the robustness of neural networks in safety-critical applications, such as autonomous vehicles navigating unpredictable weather conditions. We anticipate that this framework will contribute to advancing the reliability of neural networks in challenging environments, facilitating their widespread adoption in mission-critical scenarios.
    To Transformers and Beyond: Large Language Models for the Genome. (arXiv:2311.07621v1 [q-bio.GN])
    In the rapidly evolving landscape of genomics, deep learning has emerged as a useful tool for tackling complex computational challenges. This review focuses on the transformative role of Large Language Models (LLMs), which are mostly based on the transformer architecture, in genomics. Building on the foundation of traditional convolutional neural networks and recurrent neural networks, we explore both the strengths and limitations of transformers and other LLMs for genomics. Additionally, we contemplate the future of genomic modeling beyond the transformer architecture based on current trends in research. The paper aims to serve as a guide for computational biologists and computer scientists interested in LLMs for genomic data. We hope the paper can also serve as an educational introduction and discussion for biologists to a fundamental shift in how we will be analyzing genomic data in the future.
    Dynamic Local Attention with Hierarchical Patching for Irregular Clinical Time Series. (arXiv:2311.07744v1 [cs.LG])
    Irregular multivariate time series data is prevalent in the clinical and healthcare domains. It is characterized by time-wise and feature-wise irregularities, making it challenging for machine learning methods to work with. To solve this, we introduce a new model architecture composed of two modules: (1) DLA, a Dynamic Local Attention mechanism that uses learnable queries and feature-specific local windows when computing the self-attention operation. This results in aggregating irregular time steps raw input within each window to a harmonized regular latent space representation while taking into account the different features' sampling rates. (2) A hierarchical MLP mixer that processes the output of DLA through multi-scale patching to leverage information at various scales for the downstream tasks. Our approach outperforms state-of-the-art methods on three real-world datasets, including the latest clinical MIMIC IV dataset.
    HMOE: Hypernetwork-based Mixture of Experts for Domain Generalization. (arXiv:2211.08253v3 [cs.LG] UPDATED)
    Due to domain shifts, machine learning systems typically struggle to generalize well to new domains that differ from those of training data, which is what domain generalization (DG) aims to address. Although a variety of DG methods have been proposed, most of them fall short in interpretability and require domain labels, which are not available in many real-world scenarios. This paper presents a novel DG method, called HMOE: Hypernetwork-based Mixture of Experts (MoE), which does not rely on domain labels and is more interpretable. MoE proves effective in identifying heterogeneous patterns in data. For the DG problem, heterogeneity arises exactly from domain shifts. HMOE employs hypernetworks taking vectors as input to generate the weights of experts, which promotes knowledge sharing among experts and enables the exploration of their similarities in a low-dimensional vector space. We benchmark HMOE against other DG methods under a fair evaluation framework -- DomainBed. Our extensive experiments show that HMOE can effectively separate mixed-domain data into distinct clusters that are surprisingly more consistent with human intuition than original domain labels. Using self-learned domain information, HMOE achieves state-of-the-art results on most datasets and significantly surpasses other DG methods in average accuracy across all datasets.
    Mixture of Coupled HMMs for Robust Modeling of Multivariate Healthcare Time Series. (arXiv:2311.07867v1 [cs.LG])
    Analysis of multivariate healthcare time series data is inherently challenging: irregular sampling, noisy and missing values, and heterogeneous patient groups with different dynamics violating exchangeability. In addition, interpretability and quantification of uncertainty are critically important. Here, we propose a novel class of models, a mixture of coupled hidden Markov models (M-CHMM), and demonstrate how it elegantly overcomes these challenges. To make the model learning feasible, we derive two algorithms to sample the sequences of the latent variables in the CHMM: samplers based on (i) particle filtering and (ii) factorized approximation. Compared to existing inference methods, our algorithms are computationally tractable, improve mixing, and allow for likelihood estimation, which is necessary to learn the mixture model. Experiments on challenging real-world epidemiological and semi-synthetic data demonstrate the advantages of the M-CHMM: improved data fit, capacity to efficiently handle missing and noisy measurements, improved prediction accuracy, and ability to identify interpretable subsets in the data.
    SynthEnsemble: A Fusion of CNN, Vision Transformer, and Hybrid Models for Multi-Label Chest X-Ray Classification. (arXiv:2311.07750v1 [cs.CV])
    Chest X-rays are widely used to diagnose thoracic diseases, but the lack of detailed information about these abnormalities makes it challenging to develop accurate automated diagnosis systems, which is crucial for early detection and effective treatment. To address this challenge, we employed deep learning techniques to identify patterns in chest X-rays that correspond to different diseases. We conducted experiments on the "ChestX-ray14" dataset using various pre-trained CNNs, transformers, hybrid(CNN+Transformer) models and classical models. The best individual model was the CoAtNet, which achieved an area under the receiver operating characteristic curve (AUROC) of 84.2%. By combining the predictions of all trained models using a weighted average ensemble where the weight of each model was determined using differential evolution, we further improved the AUROC to 85.4%, outperforming other state-of-the-art methods in this field. Our findings demonstrate the potential of deep learning techniques, particularly ensemble deep learning, for improving the accuracy of automatic diagnosis of thoracic diseases from chest X-rays.
    MD-IQA: Learning Multi-scale Distributed Image Quality Assessment with Semi Supervised Learning for Low Dose CT. (arXiv:2311.08024v1 [eess.IV])
    Image quality assessment (IQA) plays a critical role in optimizing radiation dose and developing novel medical imaging techniques in computed tomography (CT). Traditional IQA methods relying on hand-crafted features have limitations in summarizing the subjective perceptual experience of image quality. Recent deep learning-based approaches have demonstrated strong modeling capabilities and potential for medical IQA, but challenges remain regarding model generalization and perceptual accuracy. In this work, we propose a multi-scale distributions regression approach to predict quality scores by constraining the output distribution, thereby improving model generalization. Furthermore, we design a dual-branch alignment network to enhance feature extraction capabilities. Additionally, semi-supervised learning is introduced by utilizing pseudo-labels for unlabeled data to guide model training. Extensive qualitative experiments demonstrate the effectiveness of our proposed method for advancing the state-of-the-art in deep learning-based medical IQA. Code is available at: https://github.com/zunzhumu/MD-IQA.
    Discretized Distributed Optimization over Dynamic Digraphs. (arXiv:2311.07939v1 [math.OC])
    We consider a discrete-time model of continuous-time distributed optimization over dynamic directed-graphs (digraphs) with applications to distributed learning. Our optimization algorithm works over general strongly connected dynamic networks under switching topologies, e.g., in mobile multi-agent systems and volatile networks due to link failures. Compared to many existing lines of work, there is no need for bi-stochastic weight designs on the links. The existing literature mostly needs the link weights to be stochastic using specific weight-design algorithms needed both at the initialization and at all times when the topology of the network changes. This paper eliminates the need for such algorithms and paves the way for distributed optimization over time-varying digraphs. We derive the bound on the gradient-tracking step-size and discrete time-step for convergence and prove dynamic stability using arguments from consensus algorithms, matrix perturbation theory, and Lyapunov theory. This work, particularly, is an improvement over existing stochastic-weight undirected networks in case of link removal or packet drops. This is because the existing literature may need to rerun time-consuming and computationally complex algorithms for stochastic design, while the proposed strategy works as long as the underlying network is weight-symmetric and balanced. The proposed optimization framework finds applications to distributed classification and learning.
    Multi-Signal Reconstruction Using Masked Autoencoder From EEG During Polysomnography. (arXiv:2311.07868v1 [cs.LG])
    Polysomnography (PSG) is an indispensable diagnostic tool in sleep medicine, essential for identifying various sleep disorders. By capturing physiological signals, including EEG, EOG, EMG, and cardiorespiratory metrics, PSG presents a patient's sleep architecture. However, its dependency on complex equipment and expertise confines its use to specialized clinical settings. Addressing these limitations, our study aims to perform PSG by developing a system that requires only a single EEG measurement. We propose a novel system capable of reconstructing multi-signal PSG from a single-channel EEG based on a masked autoencoder. The masked autoencoder was trained and evaluated using the Sleep-EDF-20 dataset, with mean squared error as the metric for assessing the similarity between original and reconstructed signals. The model demonstrated proficiency in reconstructing multi-signal data. Our results present promise for the development of more accessible and long-term sleep monitoring systems. This suggests the expansion of PSG's applicability, enabling its use beyond the confines of clinics.
    On The Truthfulness of 'Surprisingly Likely' Responses of Large Language Models. (arXiv:2311.07692v1 [cs.LG])
    The surprisingly likely criterion in the seminal work of Prelec (the Bayesian Truth Serum) guarantees truthfulness in a game-theoretic multi-agent setting, by rewarding rational agents to maximise the expected information gain with their answers w.r.t. their probabilistic beliefs. We investigate the relevance of a similar criterion for responses of LLMs. We hypothesize that if the surprisingly likely criterion works in LLMs, under certain conditions, the responses that maximize the reward under this criterion should be more accurate than the responses that only maximize the posterior probability. Using benchmarks including the TruthfulQA benchmark and using openly available LLMs: GPT-2 and LLaMA-2, we show that the method indeed improves the accuracy significantly (for example, upto 24 percentage points aggregate improvement on TruthfulQA and upto 70 percentage points improvement on individual categories of questions).
    CSLP-AE: A Contrastive Split-Latent Permutation Autoencoder Framework for Zero-Shot Electroencephalography Signal Conversion. (arXiv:2311.07788v1 [cs.LG])
    Electroencephalography (EEG) is a prominent non-invasive neuroimaging technique providing insights into brain function. Unfortunately, EEG data exhibit a high degree of noise and variability across subjects hampering generalizable signal extraction. Therefore, a key aim in EEG analysis is to extract the underlying neural activation (content) as well as to account for the individual subject variability (style). We hypothesize that the ability to convert EEG signals between tasks and subjects requires the extraction of latent representations accounting for content and style. Inspired by recent advancements in voice conversion technologies, we propose a novel contrastive split-latent permutation autoencoder (CSLP-AE) framework that directly optimizes for EEG conversion. Importantly, the latent representations are guided using contrastive learning to promote the latent splits to explicitly represent subject (style) and task (content). We contrast CSLP-AE to conventional supervised, unsupervised (AE), and self-supervised (contrastive learning) training and find that the proposed approach provides favorable generalizable characterizations of subject and task. Importantly, the procedure also enables zero-shot conversion between unseen subjects. While the present work only considers conversion of EEG, the proposed CSLP-AE provides a general framework for signal conversion and extraction of content (task activation) and style (subject variability) components of general interest for the modeling and analysis of biological signals.
    Probabilistic Physics-integrated Neural Differentiable Modeling for Isothermal Chemical Vapor Infiltration Process. (arXiv:2311.07798v1 [cs.LG])
    Chemical vapor infiltration (CVI) is a widely adopted manufacturing technique used in producing carbon-carbon and carbon-silicon carbide composites. These materials are especially valued in the aerospace and automotive industries for their robust strength and lightweight characteristics. The densification process during CVI critically influences the final performance, quality, and consistency of these composite materials. Experimentally optimizing the CVI processes is challenging due to long experimental time and large optimization space. To address these challenges, this work takes a modeling-centric approach. Due to the complexities and limited experimental data of the isothermal CVI densification process, we have developed a data-driven predictive model using the physics-integrated neural differentiable (PiNDiff) modeling framework. An uncertainty quantification feature has been embedded within the PiNDiff method, bolstering the model's reliability and robustness. Through comprehensive numerical experiments involving both synthetic and real-world manufacturing data, the proposed method showcases its capability in modeling densification during the CVI process. This research highlights the potential of the PiNDiff framework as an instrumental tool for advancing our understanding, simulation, and optimization of the CVI manufacturing process, particularly when faced with sparse data and an incomplete description of the underlying physics.
    Evaluating Neighbor Explainability for Graph Neural Networks. (arXiv:2311.08118v1 [cs.LG])
    Explainability in Graph Neural Networks (GNNs) is a new field growing in the last few years. In this publication we address the problem of determining how important is each neighbor for the GNN when classifying a node and how to measure the performance for this specific task. To do this, various known explainability methods are reformulated to get the neighbor importance and four new metrics are presented. Our results show that there is almost no difference between the explanations provided by gradient-based techniques in the GNN domain. In addition, many explainability techniques failed to identify important neighbors when GNNs without self-loops are used.
    The Hyperdimensional Transform for Distributional Modelling, Regression and Classification. (arXiv:2311.08150v1 [cs.LG])
    Hyperdimensional computing (HDC) is an increasingly popular computing paradigm with immense potential for future intelligent applications. Although the main ideas already took form in the 1990s, HDC recently gained significant attention, especially in the field of machine learning and data science. Next to efficiency, interoperability and explainability, HDC offers attractive properties for generalization as it can be seen as an attempt to combine connectionist ideas from neural networks with symbolic aspects. In recent work, we introduced the hyperdimensional transform, revealing deep theoretical foundations for representing functions and distributions as high-dimensional holographic vectors. Here, we present the power of the hyperdimensional transform to a broad data science audience. We use the hyperdimensional transform as a theoretical basis and provide insight into state-of-the-art HDC approaches for machine learning. We show how existing algorithms can be modified and how this transform can lead to a novel, well-founded toolbox. Next to the standard regression and classification tasks of machine learning, our discussion includes various aspects of statistical modelling, such as representation, learning and deconvolving distributions, sampling, Bayesian inference, and uncertainty estimation.
    Generalization Analogies (GENIES): A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains. (arXiv:2311.07723v1 [cs.AI])
    As AI systems become more intelligent and their behavior becomes more challenging to assess, they may learn to game the flaws of human feedback instead of genuinely striving to follow instructions; however, this risk can be mitigated by controlling how LLMs generalize human feedback to situations where it is unreliable. To better understand how reward models generalize, we craft 69 distribution shifts spanning 8 categories. We find that reward models do not learn to evaluate `instruction-following' by default and instead favor personas that resemble internet text. Techniques for interpreting reward models' internal representations achieve better generalization than standard fine-tuning, but still frequently fail to distinguish instruction-following from conflated behaviors. We consolidate the 15 most challenging distribution shifts into the GENaralization analogIES (GENIES) benchmark, which we hope will enable progress toward controlling reward model generalization.
    A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity. (arXiv:2302.06015v3 [cs.LG] UPDATED)
    Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical success in many vision tasks. Due to non-convex interactions across layers, however, theoretical learning and generalization analysis is mostly elusive. Based on a data model characterizing both label-relevant and label-irrelevant tokens, this paper provides the first theoretical analysis of training a shallow ViT, i.e., one self-attention layer followed by a two-layer perceptron, for a classification task. We characterize the sample complexity to achieve a zero generalization error. Our sample complexity bound is positively correlated with the inverse of the fraction of label-relevant tokens, the token noise level, and the initial model error. We also prove that a training process using stochastic gradient descent (SGD) leads to a sparse attention map, which is a formal verification of the general intuition about the success of attention. Moreover, this paper indicates that a proper token sparsification can improve the test performance by removing label-irrelevant and/or noisy tokens, including spurious correlations. Empirical experiments on synthetic data and CIFAR-10 dataset justify our theoretical results and generalize to deeper ViTs.
    Attention-based Multi-task Learning for Base Editor Outcome Prediction. (arXiv:2311.07636v1 [q-bio.GN])
    Human genetic diseases often arise from point mutations, emphasizing the critical need for precise genome editing techniques. Among these, base editing stands out as it allows targeted alterations at the single nucleotide level. However, its clinical application is hindered by low editing efficiency and unintended mutations, necessitating extensive trial-and-error experimentation in the laboratory. To speed up this process, we present an attention-based two-stage machine learning model that learns to predict the likelihood of all possible editing outcomes for a given genomic target sequence. We further propose a multi-task learning schema to jointly learn multiple base editors (i.e. variants) at once. Our model's predictions consistently demonstrated a strong correlation with the actual experimental results on multiple datasets and base editor variants. These results provide further validation for the models' capacity to enhance and accelerate the process of refining base editing designs.
    Improved Kernel Alignment Regret Bound for Online Kernel Learning. (arXiv:2212.12989v3 [cs.LG] UPDATED)
    In this paper, we improve the kernel alignment regret bound for online kernel learning in the regime of the Hinge loss function. Previous algorithm achieves a regret of $O((\mathcal{A}_TT\ln{T})^{\frac{1}{4}})$ at a computational complexity (space and per-round time) of $O(\sqrt{\mathcal{A}_TT\ln{T}})$, where $\mathcal{A}_T$ is called \textit{kernel alignment}. We propose an algorithm whose regret bound and computational complexity are better than previous results. Our results depend on the decay rate of eigenvalues of the kernel matrix. If the eigenvalues of the kernel matrix decay exponentially, then our algorithm enjoys a regret of $O(\sqrt{\mathcal{A}_T})$ at a computational complexity of $O(\ln^2{T})$. Otherwise, our algorithm enjoys a regret of $O((\mathcal{A}_TT)^{\frac{1}{4}})$ at a computational complexity of $O(\sqrt{\mathcal{A}_TT})$. We extend our algorithm to batch learning and obtain a $O(\frac{1}{T}\sqrt{\mathbb{E}[\mathcal{A}_T]})$ excess risk bound which improves the previous $O(1/\sqrt{T})$ bound.
    Investigating Multi-Pivot Ensembling with Massively Multilingual Machine Translation Models. (arXiv:2311.07439v2 [cs.CL] UPDATED)
    Massively multilingual machine translation models allow for the translation of a large number of languages with a single model, but have limited performance on low- and very-low-resource translation directions. Pivoting via high-resource languages remains a strong strategy for low-resource directions, and in this paper we revisit ways of pivoting through multiple languages. Previous work has used a simple averaging of probability distributions from multiple paths, but we find that this performs worse than using a single pivot, and exacerbates the hallucination problem because the same hallucinations can be probable across different paths. As an alternative, we propose MaxEns, a combination strategy that is biased towards the most confident predictions, hypothesising that confident predictions are less prone to be hallucinations. We evaluate different strategies on the FLORES benchmark for 20 low-resource language directions, demonstrating that MaxEns improves translation quality for low-resource languages while reducing hallucination in translations, compared to both direct translation and an averaging approach. On average, multi-pivot strategies still lag behind using English as a single pivot language, raising the question of how to identify the best pivoting strategy for a given translation direction.
    Generative Intrinsic Optimization: Intrinsic Control with Model Learning. (arXiv:2310.08100v2 [cs.LG] UPDATED)
    Future sequence represents the outcome after executing the action into the environment (i.e. the trajectory onwards). When driven by the information-theoretic concept of mutual information, it seeks maximally informative consequences. Explicit outcomes may vary across state, return, or trajectory serving different purposes such as credit assignment or imitation learning. However, the inherent nature of incorporating intrinsic motivation with reward maximization is often neglected. In this work, we propose a policy iteration scheme that seamlessly incorporates the mutual information, ensuring convergence to the optimal policy. Concurrently, a variational approach is introduced, which jointly learns the necessary quantity for estimating the mutual information and the dynamics model, providing a general framework for incorporating different forms of outcomes of interest. While we mainly focus on theoretical analysis, our approach opens the possibilities of leveraging intrinsic control with model learning to enhance sample efficiency and incorporate uncertainty of the environment into decision-making.
    Denoising diffusion-based MRI to CT image translation enables automated spinal segmentation. (arXiv:2308.09345v2 [eess.IV] UPDATED)
    Background: Automated segmentation of spinal MR images plays a vital role both scientifically and clinically. However, accurately delineating posterior spine structures presents challenges. Methods: This retrospective study, approved by the ethical committee, involved translating T1w and T2w MR image series into CT images in a total of n=263 pairs of CT/MR series. Landmark-based registration was performed to align image pairs. We compared 2D paired (Pix2Pix, denoising diffusion implicit models (DDIM) image mode, DDIM noise mode) and unpaired (contrastive unpaired translation, SynDiff) image-to-image translation using "peak signal to noise ratio" (PSNR) as quality measure. A publicly available segmentation network segmented the synthesized CT datasets, and Dice scores were evaluated on in-house test sets and the "MRSpineSeg Challenge" volumes. The 2D findings were extended to 3D Pix2Pix and DDIM. Results: 2D paired methods and SynDiff exhibited similar translation performance and Dice scores on paired data. DDIM image mode achieved the highest image quality. SynDiff, Pix2Pix, and DDIM image mode demonstrated similar Dice scores (0.77). For craniocaudal axis rotations, at least two landmarks per vertebra were required for registration. The 3D translation outperformed the 2D approach, resulting in improved Dice scores (0.80) and anatomically accurate segmentations in a higher resolution than the original MR image. Conclusion: Two landmarks per vertebra registration enabled paired image-to-image translation from MR to CT and outperformed all unpaired approaches. The 3D techniques provided anatomically correct segmentations, avoiding underprediction of small structures like the spinous process.
    Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes. (arXiv:2212.05949v3 [stat.ML] UPDATED)
    Despite the significant interest and progress in reinforcement learning (RL) problems with adversarial corruption, current works are either confined to the linear setting or lead to an undesired $\tilde{O}(\sqrt{T}\zeta)$ regret bound, where $T$ is the number of rounds and $\zeta$ is the total amount of corruption. In this paper, we consider the contextual bandit with general function approximation and propose a computationally efficient algorithm to achieve a regret of $\tilde{O}(\sqrt{T}+\zeta)$. The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit and a new weighted estimator of uncertainty for the general function class. In contrast to the existing analysis that heavily relies on the linear structure, we develop a novel technique to control the sum of weighted uncertainty, thus establishing the final regret bounds. We then generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $\zeta$ in the scenario of general function approximation. Notably, our algorithms achieve regret bounds either nearly match the performance lower bound or improve the existing methods for all the corruption levels and in both known and unknown $\zeta$ cases.
    SAMIHS: Adaptation of Segment Anything Model for Intracranial Hemorrhage Segmentation. (arXiv:2311.08190v1 [eess.IV])
    Segment Anything Model (SAM), a vision foundation model trained on large-scale annotations, has recently continued raising awareness within medical image segmentation. Despite the impressive capabilities of SAM on natural scenes, it struggles with performance decline when confronted with medical images, especially those involving blurry boundaries and highly irregular regions of low contrast. In this paper, a SAM-based parameter-efficient fine-tuning method, called SAMIHS, is proposed for intracranial hemorrhage segmentation, which is a crucial and challenging step in stroke diagnosis and surgical planning. Distinguished from previous SAM and SAM-based methods, SAMIHS incorporates parameter-refactoring adapters into SAM's image encoder and considers the efficient and flexible utilization of adapters' parameters. Additionally, we employ a combo loss that combines binary cross-entropy loss and boundary-sensitive loss to enhance SAMIHS's ability to recognize the boundary regions. Our experimental results on two public datasets demonstrate the effectiveness of our proposed method. Code is available at https://github.com/mileswyn/SAMIHS .
    Counterfactual Explanation for Regression via Disentanglement in Latent Space. (arXiv:2311.08228v1 [cs.LG])
    Counterfactual Explanations (CEs) help address the question: How can the factors that influence the prediction of a predictive model be changed to achieve a more favorable outcome from a user's perspective? Thus, they bear the potential to guide the user's interaction with AI systems since they represent easy-to-understand explanations. To be applicable, CEs need to be realistic and actionable. In the literature, various methods have been proposed to generate CEs. However, the majority of research on CEs focuses on classification problems where questions like ``What should I do to get my rejected loan approved?" are raised. In practice, answering questions like ``What should I do to increase my salary?" are of a more regressive nature. In this paper, we introduce a novel method to generate CEs for a pre-trained regressor by first disentangling the label-relevant from the label-irrelevant dimensions in the latent space. CEs are then generated by combining the label-irrelevant dimensions and the predefined output. The intuition behind this approach is that the ideal counterfactual search should focus on the label-irrelevant characteristics of the input and suggest changes toward target-relevant characteristics. Searching in the latent space could help achieve this goal. We show that our method maintains the characteristics of the query sample during the counterfactual search. In various experiments, we demonstrate that the proposed method is competitive based on different quality measures on image and tabular datasets in regression problem settings. It efficiently returns results closer to the original data manifold compared to three state-of-the-art methods, which is essential for realistic high-dimensional machine learning applications. Our code will be made available as an open-source package upon the publication of this work.
    Instruction-Following Evaluation for Large Language Models. (arXiv:2311.07911v1 [cs.CL])
    One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval
    DynaConF: Dynamic Forecasting of Non-Stationary Time-Series. (arXiv:2209.08411v2 [cs.LG] UPDATED)
    Deep learning has shown impressive results in a variety of time series forecasting tasks, where modeling the conditional distribution of the future given the past is the essence. However, when this conditional distribution is non-stationary, it poses challenges for these models to learn consistently and to predict accurately. In this work, we propose a new method to model non-stationary conditional distributions over time by clearly decoupling stationary conditional distribution modeling from non-stationary dynamics modeling. Our method is based on a Bayesian dynamic model that can adapt to conditional distribution changes and a deep conditional distribution model that handles multivariate time series using a factorized output space. Our experimental results on synthetic and real-world datasets show that our model can adapt to non-stationary time series better than state-of-the-art deep learning solutions.
    GMTR: Graph Matching Transformers. (arXiv:2311.08141v1 [cs.CV])
    Vision transformers (ViTs) have recently been used for visual matching beyond object detection and segmentation. However, the original grid dividing strategy of ViTs neglects the spatial information of the keypoints, limiting the sensitivity to local information. Therefore, we propose \textbf{QueryTrans} (Query Transformer), which adopts a cross-attention module and keypoints-based center crop strategy for better spatial information extraction. We further integrate the graph attention module and devise a transformer-based graph matching approach \textbf{GMTR} (Graph Matching TRansformers) whereby the combinatorial nature of GM is addressed by a graph transformer neural GM solver. On standard GM benchmarks, GMTR shows competitive performance against the SOTA frameworks. Specifically, on Pascal VOC, GMTR achieves $\mathbf{83.6\%}$ accuracy, $\mathbf{0.9\%}$ higher than the SOTA framework. On Spair-71k, GMTR shows great potential and outperforms most of the previous works. Meanwhile, on Pascal VOC, QueryTrans improves the accuracy of NGMv2 from $80.1\%$ to $\mathbf{83.3\%}$, and BBGM from $79.0\%$ to $\mathbf{84.5\%}$. On Spair-71k, QueryTrans improves NGMv2 from $80.6\%$ to $\mathbf{82.5\%}$, and BBGM from $82.1\%$ to $\mathbf{83.9\%}$. Source code will be made publicly available.
    Unbiased Learning of Deep Generative Models with Structured Discrete Representations. (arXiv:2306.08230v2 [cs.LG] UPDATED)
    By composing graphical models with deep learning architectures, we learn generative models with the strengths of both frameworks. The structured variational autoencoder (SVAE) inherits structure and interpretability from graphical models, and flexible likelihoods for high-dimensional data from deep learning, but poses substantial optimization challenges. We propose novel algorithms for learning SVAEs, and are the first to demonstrate the SVAE's ability to handle multimodal uncertainty when data is missing by incorporating discrete latent variables. Our memory-efficient implicit differentiation scheme makes the SVAE tractable to learn via gradient descent, while demonstrating robustness to incomplete optimization. To more rapidly learn accurate graphical model parameters, we derive a method for computing natural gradients without manual derivations, which avoids biases found in prior work. These optimization innovations enable the first comparisons of the SVAE to state-of-the-art time series models, where the SVAE performs competitively while learning interpretable and structured discrete data representations.
    MechAgents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge. (arXiv:2311.08166v1 [cs.AI])
    Solving mechanics problems using numerical methods requires comprehensive intelligent capability of retrieving relevant knowledge and theory, constructing and executing codes, analyzing the results, a task that has thus far mainly been reserved for humans. While emerging AI methods can provide effective approaches to solve end-to-end problems, for instance via the use of deep surrogate models or various data analytics strategies, they often lack physical intuition since knowledge is baked into the parametric complement through training, offering less flexibility when it comes to incorporating mathematical or physical insights. By leveraging diverse capabilities of multiple dynamically interacting large language models (LLMs), we can overcome the limitations of conventional approaches and develop a new class of physics-inspired generative machine learning platform, here referred to as MechAgents. A set of AI agents can solve mechanics tasks, here demonstrated for elasticity problems, via autonomous collaborations. A two-agent team can effectively write, execute and self-correct code, in order to apply finite element methods to solve classical elasticity problems in various flavors (different boundary conditions, domain geometries, meshes, small/finite deformation and linear/hyper-elastic constitutive laws, and others). For more complex tasks, we construct a larger group of agents with enhanced division of labor among planning, formulating, coding, executing and criticizing the process and results. The agents mutually correct each other to improve the overall team-work performance in understanding, formulating and validating the solution. Our framework shows the potential of synergizing the intelligence of language models, the reliability of physics-based modeling, and the dynamic collaborations among diverse agents, opening novel avenues for automation of solving engineering problems.
    Low-rank variational Bayes correction to the Laplace method. (arXiv:2111.12945v2 [stat.ME] UPDATED)
    Approximate inference methods like the Laplace method, Laplace approximations and variational methods, amongst others, are popular methods when exact inference is not feasible due to the complexity of the model or the abundance of data. In this paper we propose a hybrid approximate method called Low-Rank Variational Bayes correction (VBC), that uses the Laplace method and subsequently a Variational Bayes correction in a lower dimension, to the joint posterior mean. The cost is essentially that of the Laplace method which ensures scalability of the method, in both model complexity and data size. Models with fixed and unknown hyperparameters are considered, for simulated and real examples, for small and large datasets.
    Evolutionary-enhanced quantum supervised learning model. (arXiv:2311.08081v1 [quant-ph])
    Quantum supervised learning, utilizing variational circuits, stands out as a promising technology for NISQ devices due to its efficiency in hardware resource utilization during the creation of quantum feature maps and the implementation of hardware-efficient ansatz with trainable parameters. Despite these advantages, the training of quantum models encounters challenges, notably the barren plateau phenomenon, leading to stagnation in learning during optimization iterations. This study proposes an innovative approach: an evolutionary-enhanced ansatz-free supervised learning model. In contrast to parametrized circuits, our model employs circuits with variable topology that evolves through an elitist method, mitigating the barren plateau issue. Additionally, we introduce a novel concept, the superposition of multi-hot encodings, facilitating the treatment of multi-classification problems. Our framework successfully avoids barren plateaus, resulting in enhanced model accuracy. Comparative analysis with variational quantum classifiers from the technology's state-of-the-art reveal a substantial improvement in training efficiency and precision. Furthermore, we conduct tests on a challenging dataset class, traditionally problematic for conventional kernel machines, demonstrating a potential alternative path for achieving quantum advantage in supervised learning for NISQ era.
    Leveraging Hamilton-Jacobi PDEs with time-dependent Hamiltonians for continual scientific machine learning. (arXiv:2311.07790v1 [cs.LG])
    We address two major challenges in scientific machine learning (SciML): interpretability and computational efficiency. We increase the interpretability of certain learning processes by establishing a new theoretical connection between optimization problems arising from SciML and a generalized Hopf formula, which represents the viscosity solution to a Hamilton-Jacobi partial differential equation (HJ PDE) with time-dependent Hamiltonian. Namely, we show that when we solve certain regularized learning problems with integral-type losses, we actually solve an optimal control problem and its associated HJ PDE with time-dependent Hamiltonian. This connection allows us to reinterpret incremental updates to learned models as the evolution of an associated HJ PDE and optimal control problem in time, where all of the previous information is intrinsically encoded in the solution to the HJ PDE. As a result, existing HJ PDE solvers and optimal control algorithms can be reused to design new efficient training approaches for SciML that naturally coincide with the continual learning framework, while avoiding catastrophic forgetting. As a first exploration of this connection, we consider the special case of linear regression and leverage our connection to develop a new Riccati-based methodology for solving these learning problems that is amenable to continual learning applications. We also provide some corresponding numerical examples that demonstrate the potential computational and memory advantages our Riccati-based approach can provide.
    Learning Adversarial Low-rank Markov Decision Processes with Unknown Transition and Full-information Feedback. (arXiv:2311.07876v1 [cs.LG])
    In this work, we study the low-rank MDPs with adversarially changed losses in the full-information feedback setting. In particular, the unknown transition probability kernel admits a low-rank matrix decomposition \citep{REPUCB22}, and the loss functions may change adversarially but are revealed to the learner at the end of each episode. We propose a policy optimization-based algorithm POLO, and we prove that it attains the $\widetilde{O}(K^{\frac{5}{6}}A^{\frac{1}{2}}d\ln(1+M)/(1-\gamma)^2)$ regret guarantee, where $d$ is rank of the transition kernel (and hence the dimension of the unknown representations), $A$ is the cardinality of the action space, $M$ is the cardinality of the model class, and $\gamma$ is the discounted factor. Notably, our algorithm is oracle-efficient and has a regret guarantee with no dependence on the size of potentially arbitrarily large state space. Furthermore, we also prove an $\Omega(\frac{\gamma^2}{1-\gamma} \sqrt{d A K})$ regret lower bound for this problem, showing that low-rank MDPs are statistically more difficult to learn than linear MDPs in the regret minimization setting. To the best of our knowledge, we present the first algorithm that interleaves representation learning, exploration, and exploitation to achieve the sublinear regret guarantee for RL with nonlinear function approximation and adversarial losses.
    Toward Efficient and Incremental Spectral Clustering via Parametric Spectral Clustering. (arXiv:2311.07833v1 [cs.LG])
    Spectral clustering is a popular method for effectively clustering nonlinearly separable data. However, computational limitations, memory requirements, and the inability to perform incremental learning challenge its widespread application. To overcome these limitations, this paper introduces a novel approach called parametric spectral clustering (PSC). By extending the capabilities of spectral clustering, PSC addresses the challenges associated with big data and real-time scenarios and enables efficient incremental clustering with new data points. Experimental evaluations conducted on various open datasets demonstrate the superiority of PSC in terms of computational efficiency while achieving clustering quality mostly comparable to standard spectral clustering. The proposed approach has significant potential for incremental and real-time data analysis applications, facilitating timely and accurate clustering in dynamic and evolving datasets. The findings of this research contribute to the advancement of clustering techniques and open new avenues for efficient and effective data analysis. We publish the experimental code at https://github.com/109502518/PSC_BigData.
    Enhancing Actuarial Non-Life Pricing Models via Transformers. (arXiv:2311.07597v1 [cs.LG])
    Currently, there is a lot of research in the field of neural networks for non-life insurance pricing. The usual goal is to improve the predictive power via neural networks while building upon the generalized linear model, which is the current industry standard. Our paper contributes to this current journey via novel methods to enhance actuarial non-life models with transformer models for tabular data. We build here upon the foundation laid out by the combined actuarial neural network as well as the localGLMnet and enhance those models via the feature tokenizer transformer. The manuscript demonstrates the performance of the proposed methods on a real-world claim frequency dataset and compares them with several benchmark models such as generalized linear models, feed-forward neural networks, combined actuarial neural networks, LocalGLMnet, and pure feature tokenizer transformer. The paper shows that the new methods can achieve better results than the benchmark models while preserving certain generalized linear model advantages. The paper also discusses the practical implications and challenges of applying transformer models in actuarial settings.
    Comparison of two data fusion approaches for land use classification. (arXiv:2311.07967v1 [cs.LG])
    Accurate land use maps, describing the territory from an anthropic utilisation point of view, are useful tools for land management and planning. To produce them, the use of optical images alone remains limited. It is therefore necessary to make use of several heterogeneous sources, each carrying complementary or contradictory information due to their imperfections or their different specifications. This study compares two different approaches i.e. a pre-classification and a post-classification fusion approach for combining several sources of spatial data in the context of land use classification. The approaches are applied on authoritative land use data located in the Gers department in the southwest of France. Pre-classification fusion, while not explicitly modeling imperfections, has the best final results, reaching an overall accuracy of 97% and a macro-mean F1 score of 88%.
    Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification. (arXiv:2311.07593v1 [cs.CL])
    A promising approach for improving the performance of vision-language models like CLIP for image classification is to extend the class descriptions (i.e., prompts) with related attributes, e.g., using brown sparrow instead of sparrow. However, current zero-shot methods select a subset of attributes regardless of commonalities between the target classes, potentially providing no useful information that would have helped to distinguish between them. For instance, they may use color instead of bill shape to distinguish between sparrows and wrens, which are both brown. We propose Follow-up Differential Descriptions (FuDD), a zero-shot approach that tailors the class descriptions to each dataset and leads to additional attributes that better differentiate the target classes. FuDD first identifies the ambiguous classes for each image, and then uses a Large Language Model (LLM) to generate new class descriptions that differentiate between them. The new class descriptions resolve the initial ambiguity and help predict the correct label. In our experiments, FuDD consistently outperforms generic description ensembles and naive LLM-generated descriptions on 12 datasets. We show that differential descriptions are an effective tool to resolve class ambiguities, which otherwise significantly degrade the performance. We also show that high quality natural language class descriptions produced by FuDD result in comparable performance to few-shot adaptation methods.  ( 2 min )
    RoboSense At Edge: Detecting Slip, Crumple and Shape of the Object in Robotic Hand for Teleoprations. (arXiv:2311.07888v1 [cs.RO])
    Slip and crumple detection is essential for performing robust manipulation tasks with a robotic hand (RH) like remote surgery. It has been one of the challenging problems in the robotics manipulation community. In this work, we propose a technique based on machine learning (ML) based techniques to detect the slip, and crumple as well as the shape of an object that is currently held in the robotic hand. We proposed ML model will detect the slip, crumple, and shape using the force/torque exerted and the angular positions of the actuators present in the RH. The proposed model would be integrated into the loop of a robotic hand(RH) and haptic glove(HG). This would help us to reduce the latency in case of teleoperation  ( 2 min )
    Rethinking and Benchmarking Predict-then-Optimize Paradigm for Combinatorial Optimization Problems. (arXiv:2311.07633v1 [cs.LG])
    Numerous web applications rely on solving combinatorial optimization problems, such as energy cost-aware scheduling, budget allocation on web advertising, and graph matching on social networks. However, many optimization problems involve unknown coefficients, and improper predictions of these factors may lead to inferior decisions which may cause energy wastage, inefficient resource allocation, inappropriate matching in social networks, etc. Such a research topic is referred to as "Predict-Then-Optimize (PTO)" which considers the performance of prediction and decision-making in a unified system. A noteworthy recent development is the end-to-end methods by directly optimizing the ultimate decision quality which claims to yield better results in contrast to the traditional two-stage approach. However, the evaluation benchmarks in this field are fragmented and the effectiveness of various models in different scenarios remains unclear, hindering the comprehensive assessment and fast deployment of these methods. To address these issues, we provide a comprehensive categorization of current approaches and integrate existing experimental scenarios to establish a unified benchmark, elucidating the circumstances under which end-to-end training yields improvements, as well as the contexts in which it performs ineffectively. We also introduce a new dataset for the industrial combinatorial advertising problem for inclusive finance to open-source. We hope the rethinking and benchmarking of PTO could facilitate more convenient evaluation and deployment, and inspire further improvements both in the academy and industry within this field.
    Learning Mutually Informed Representations for Characters and Subwords. (arXiv:2311.07853v1 [cs.CL])
    Most pretrained language models rely on subword tokenization, which processes text as a sequence of subword tokens. However, different granularities of text, such as characters, subwords, and words, can contain different kinds of information. Previous studies have shown that incorporating multiple input granularities improves model generalization, yet very few of them outputs useful representations for each granularity. In this paper, we introduce the entanglement model, aiming to combine character and subword language models. Inspired by vision-language models, our model treats characters and subwords as separate modalities, and it generates mutually informed representations for both granularities as output. We evaluate our model on text classification, named entity recognition, and POS-tagging tasks. Notably, the entanglement model outperforms its backbone language models, particularly in the presence of noisy texts and low-resource languages. Furthermore, the entanglement model even outperforms larger pre-trained models on all English sequence labeling tasks and classification tasks. Our anonymized code is available at https://anonymous.4open.science/r/noisy-IE-A673  ( 2 min )
    Language Models are Better Bug Detector Through Code-Pair Classification. (arXiv:2311.07957v1 [cs.SE])
    Large language models (LLMs) such as GPT-3.5 and CodeLlama are powerful models for code generation and understanding. Fine-tuning these models comes with a high computational cost and requires a large labeled dataset. Alternatively, in-context learning techniques allow models to learn downstream tasks with only a few examples. Recently, researchers have shown how in-context learning performs well in bug detection and repair. In this paper, we propose code-pair classification task in which both the buggy and non-buggy versions are given to the model, and the model identifies the buggy ones. We evaluate our task in real-world dataset of bug detection and two most powerful LLMs. Our experiments indicate that an LLM can often pick the buggy from the non-buggy version of the code, and the code-pair classification task is much easier compared to be given a snippet and deciding if and where a bug exists.
    EPIM: Efficient Processing-In-Memory Accelerators based on Epitome. (arXiv:2311.07620v1 [cs.AR])
    The exploration of Processing-In-Memory (PIM) accelerators has garnered significant attention within the research community. However, the utilization of large-scale neural networks on Processing-In-Memory (PIM) accelerators encounters challenges due to constrained on-chip memory capacity. To tackle this issue, current works explore model compression algorithms to reduce the size of Convolutional Neural Networks (CNNs). Most of these algorithms either aim to represent neural operators with reduced-size parameters (e.g., quantization) or search for the best combinations of neural operators (e.g., neural architecture search). Designing neural operators to align with PIM accelerators' specifications is an area that warrants further study. In this paper, we introduce the Epitome, a lightweight neural operator offering convolution-like functionality, to craft memory-efficient CNN operators for PIM accelerators (EPIM). On the software side, we evaluate epitomes' latency and energy on PIM accelerators and introduce a PIM-aware layer-wise design method to enhance their hardware efficiency. We apply epitome-aware quantization to further reduce the size of epitomes. On the hardware side, we modify the datapath of current PIM accelerators to accommodate epitomes and implement a feature map reuse technique to reduce computation cost. Experimental results reveal that our 3-bit quantized EPIM-ResNet50 attains 71.59% top-1 accuracy on ImageNet, reducing crossbar areas by 30.65 times. EPIM surpasses the state-of-the-art pruning methods on PIM.  ( 2 min )
    Language Model-In-The-Loop: Data Optimal Approach to Learn-To-Recommend Actions in Text Games. (arXiv:2311.07687v1 [cs.CL])
    Large Language Models (LLMs) have demonstrated superior performance in language understanding benchmarks. CALM, a popular approach, leverages linguistic priors of LLMs -- GPT-2 -- for action candidate recommendations to improve the performance in text games in Jericho without environment-provided actions. However, CALM adapts GPT-2 with annotated human gameplays and keeps the LLM fixed during the learning of the text based games. In this work, we explore and evaluate updating LLM used for candidate recommendation during the learning of the text based game as well to mitigate the reliance on the human annotated gameplays, which are costly to acquire. We observe that by updating the LLM during learning using carefully selected in-game transitions, we can reduce the dependency on using human annotated game plays for fine-tuning the LLMs. We conducted further analysis to study the transferability of the updated LLMs and observed that transferring in-game trained models to other games did not result in a consistent transfer.  ( 2 min )
    A Fast and Simple Algorithm for computing the MLE of Amplitude Density Function Parameters. (arXiv:2311.07951v1 [stat.ME])
    Over the last decades, the family of $\alpha$-stale distributions has proven to be useful for modelling in telecommunication systems. Particularly, in the case of radar applications, finding a fast and accurate estimation for the amplitude density function parameters appears to be very important. In this work, the maximum likelihood estimator (MLE) is proposed for parameters of the amplitude distribution. To do this, the amplitude data are \emph{projected} on the horizontal and vertical axes using two simple transformations. It is proved that the \emph{projected} data follow a zero-location symmetric $\alpha$-stale distribution for which the MLE can be computed quite fast. The average of computed MLEs based on two \emph{projections} is considered as estimator for parameters of the amplitude distribution. Performance of the proposed \emph{projection} method is demonstrated through simulation study and analysis of two sets of real radar data.
    Bayesian Conditional Diffusion Models for Versatile Spatiotemporal Turbulence Generation. (arXiv:2311.07896v1 [physics.flu-dyn])
    Turbulent flows have historically presented formidable challenges to predictive computational modeling. Traditional numerical simulations often require vast computational resources, making them infeasible for numerous engineering applications. As an alternative, deep learning-based surrogate models have emerged, offering data-drive solutions. However, these are typically constructed within deterministic settings, leading to shortfall in capturing the innate chaotic and stochastic behaviors of turbulent dynamics. We introduce a novel generative framework grounded in probabilistic diffusion models for versatile generation of spatiotemporal turbulence. Our method unifies both unconditional and conditional sampling strategies within a Bayesian framework, which can accommodate diverse conditioning scenarios, including those with a direct differentiable link between specified conditions and generated unsteady flow outcomes, and scenarios lacking such explicit correlations. A notable feature of our approach is the method proposed for long-span flow sequence generation, which is based on autoregressive gradient-based conditional sampling, eliminating the need for cumbersome retraining processes. We showcase the versatile turbulence generation capability of our framework through a suite of numerical experiments, including: 1) the synthesis of LES simulated instantaneous flow sequences from URANS inputs; 2) holistic generation of inhomogeneous, anisotropic wall-bounded turbulence, whether from given initial conditions, prescribed turbulence statistics, or entirely from scratch; 3) super-resolved generation of high-speed turbulent boundary layer flows from low-resolution data across a range of input resolutions. Collectively, our numerical experiments highlight the merit and transformative potential of the proposed methods, making a significant advance in the field of turbulence generation.  ( 2 min )
    Relative intrinsic dimensionality is intrinsic to learning. (arXiv:2311.07579v1 [cs.LG])
    High dimensional data can have a surprising property: pairs of data points may be easily separated from each other, or even from arbitrary subsets, with high probability using just simple linear classifiers. However, this is more of a rule of thumb than a reliable property as high dimensionality alone is neither necessary nor sufficient for successful learning. Here, we introduce a new notion of the intrinsic dimension of a data distribution, which precisely captures the separability properties of the data. For this intrinsic dimension, the rule of thumb above becomes a law: high intrinsic dimension guarantees highly separable data. We extend this notion to that of the relative intrinsic dimension of two data distributions, which we show provides both upper and lower bounds on the probability of successfully learning and generalising in a binary classification problem
    ResMGCN: Residual Message Graph Convolution Network for Fast Biomedical Interactions Discovering. (arXiv:2311.07632v1 [cs.LG])
    Biomedical information graphs are crucial for interaction discovering of biomedical information in modern age, such as identification of multifarious molecular interactions and drug discovery, which attracts increasing interests in biomedicine, bioinformatics, and human healthcare communities. Nowadays, more and more graph neural networks have been proposed to learn the entities of biomedical information and precisely reveal biomedical molecule interactions with state-of-the-art results. These methods remedy the fading of features from a far distance but suffer from remedying such problem at the expensive cost of redundant memory and time. In our paper, we propose a novel Residual Message Graph Convolution Network (ResMGCN) for fast and precise biomedical interaction prediction in a different idea. Specifically, instead of enhancing the message from far nodes, ResMGCN aggregates lower-order information with the next round higher information to guide the node update to obtain a more meaningful node representation. ResMGCN is able to perceive and preserve various messages from the previous layer and high-order information in the current layer with least memory and time cost to obtain informative representations of biomedical entities. We conduct experiments on four biomedical interaction network datasets, including protein-protein, drug-drug, drug-target, and gene-disease interactions, which demonstrates that ResMGCN outperforms previous state-of-the-art models while achieving superb effectiveness on both storage and time.  ( 2 min )
    Finding Inductive Loop Invariants using Large Language Models. (arXiv:2311.07948v1 [cs.PL])
    Loop invariants are fundamental to reasoning about programs with loops. They establish properties about a given loop's behavior. When they additionally are inductive, they become useful for the task of formal verification that seeks to establish strong mathematical guarantees about program's runtime behavior. The inductiveness ensures that the invariants can be checked locally without consulting the entire program, thus are indispensable artifacts in a formal proof of correctness. Finding inductive loop invariants is an undecidable problem, and despite a long history of research towards practical solutions, it remains far from a solved problem. This paper investigates the capabilities of the Large Language Models (LLMs) in offering a new solution towards this old, yet important problem. To that end, we first curate a dataset of verification problems on programs with loops. Next, we design a prompt for exploiting LLMs, obtaining inductive loop invariants, that are checked for correctness using sound symbolic tools. Finally, we explore the effectiveness of using an efficient combination of a symbolic tool and an LLM on our dataset and compare it against a purely symbolic baseline. Our results demonstrate that LLMs can help improve the state-of-the-art in automated program verification.
    Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?. (arXiv:2311.07587v1 [cs.CL])
    We introduce and study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment. This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete. Even in the simple setting of 1-digit addition problems, it is easy to find adversarial prompts that make all tested models (including PaLM2, GPT4, Claude2) misbehave, and even to steer models to a particular wrong answer. We additionally provide a simple algorithm for finding successful attacks by querying those same models, which we name "prompt inversion rejection sampling" (PIRS). We finally show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops. However, we were not able to make a language model fully robust against adversarial arithmetic attacks.
    Adversarial Preference Optimization. (arXiv:2311.08045v1 [cs.CL])
    Human preference alignment is a crucial training step to improve the interaction quality of large language models (LLMs). Existing aligning methods depend on manually annotated preference data to guide the LLM optimization directions. However, in practice, continuously updating LLMs raises a distribution gap between model-generated samples and human-preferred responses, which hinders model fine-tuning efficiency. To mitigate this issue, previous methods require additional preference annotation on generated samples to adapt the shifted distribution, which consumes a large amount of annotation resources. Targeting more efficient human preference optimization, we propose an adversarial preference optimization (APO) framework, where the LLM agent and the preference model update alternatively via a min-max game. Without additional annotation, our APO method can make a self-adaption to the generation distribution gap through the adversarial learning process. In experiments, we empirically verify the effectiveness of APO in improving LLM's helpfulness and harmlessness compared with rejection sampling baselines.
    Performance Prediction of Data-Driven Knowledge summarization of High Entropy Alloys (HEAs) literature implementing Natural Language Processing algorithms. (arXiv:2311.07584v1 [cs.CL])
    The ability to interpret spoken language is connected to natural language processing. It involves teaching the AI how words relate to one another, how they are meant to be used, and in what settings. The goal of natural language processing (NLP) is to get a machine intelligence to process words the same way a human brain does. This enables machine intelligence to interpret, arrange, and comprehend textual data by processing the natural language. The technology can comprehend what is communicated, whether it be through speech or writing because AI pro-cesses language more quickly than humans can. In the present study, five NLP algorithms, namely, Geneism, Sumy, Luhn, Latent Semantic Analysis (LSA), and Kull-back-Liebler (KL) al-gorithm, are implemented for the first time for the knowledge summarization purpose of the High Entropy Alloys (HEAs). The performance prediction of these algorithms is made by using the BLEU score and ROUGE score. The results showed that the Luhn algorithm has the highest accuracy score for the knowledge summarization tasks compared to the other used algorithms.  ( 2 min )
    DiLoCo: Distributed Low-Communication Training of Language Models. (arXiv:2311.08105v1 [cs.LG])
    Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected accelerators, with devices exchanging gradients and other intermediate states at each optimization step. While it is difficult to build and maintain a single computing cluster hosting many accelerators, it might be easier to find several computing clusters each hosting a smaller number of devices. In this work, we propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of language models on islands of devices that are poorly connected. The approach is a variant of federated averaging, where the number of inner steps is large, the inner optimizer is AdamW, and the outer optimizer is Nesterov momentum. On the widely used C4 dataset, we show that DiLoCo on 8 workers performs as well as fully synchronous optimization while communicating 500 times less. DiLoCo exhibits great robustness to the data distribution of each worker. It is also robust to resources becoming unavailable over time, and vice versa, it can seamlessly leverage resources that become available during training.
    Finetuning Text-to-Image Diffusion Models for Fairness. (arXiv:2311.07604v1 [cs.LG])
    The rapid adoption of text-to-image diffusion models in society underscores an urgent need to address their biases. Without interventions, these biases could propagate a distorted worldview and limit opportunities for minority groups. In this work, we frame fairness as a distributional alignment problem. Our solution consists of two main technical contributions: (1) a distributional alignment loss that steers specific characteristics of the generated images towards a user-defined target distribution, and (2) biased direct finetuning of diffusion model's sampling process, which leverages a biased gradient to more effectively optimize losses defined on the generated images. Empirically, our method markedly reduces gender, racial, and their intersectional biases for occupational prompts. Gender bias is significantly reduced even when finetuning just five soft tokens. Crucially, our method supports diverse perspectives of fairness beyond absolute equality, which is demonstrated by controlling age to a $75\%$ young and $25\%$ old distribution while simultaneously debiasing gender and race. Finally, our method is scalable: it can debias multiple concepts at once by simply including these prompts in the finetuning data. We hope our work facilitates the social alignment of T2I generative AI. We will share code and various debiased diffusion model adaptors.  ( 2 min )
    Quantifying Credit Portfolio sensitivity to asset correlations with interpretable generative neural networks. (arXiv:2309.08652v2 [q-fin.RM] UPDATED)
    In this research, we propose a novel approach for the quantification of credit portfolio Value-at-Risk (VaR) sensitivity to asset correlations with the use of synthetic financial correlation matrices generated with deep learning models. In previous work Generative Adversarial Networks (GANs) were employed to demonstrate the generation of plausible correlation matrices, that capture the essential characteristics observed in empirical correlation matrices estimated on asset returns. Instead of GANs, we employ Variational Autoencoders (VAE) to achieve a more interpretable latent space representation. Through our analysis, we reveal that the VAE latent space can be a useful tool to capture the crucial factors impacting portfolio diversification, particularly in relation to credit portfolio sensitivity to asset correlations changes.
    A Physics-informed Machine Learning-based Control Method for Nonlinear Dynamic Systems with Highly Noisy Measurements. (arXiv:2311.07613v1 [eess.SY])
    This study presents a physics-informed machine learning-based control method for nonlinear dynamic systems with highly noisy measurements. Existing data-driven control methods that use machine learning for system identification cannot effectively cope with highly noisy measurements, resulting in unstable control performance. To address this challenge, the present study extends current physics-informed machine learning capabilities for modeling nonlinear dynamics with control and integrates them into a model predictive control framework. To demonstrate the capability of the proposed method we test and validate with two noisy nonlinear dynamic systems: the chaotic Lorenz 3 system, and turning machine tool. Analysis of the results illustrate that the proposed method outperforms state-of-the-art benchmarks as measured by both modeling accuracy and control performance for nonlinear dynamic systems under high-noise conditions.  ( 2 min )
    A Decision Support System for Liver Diseases Prediction: Integrating Batch Processing, Rule-Based Event Detection and SPARQL Query. (arXiv:2311.07595v1 [cs.AI])
    Liver diseases pose a significant global health burden, impacting a substantial number of individuals and exerting substantial economic and social consequences. Rising liver problems are considered a fatal disease in many countries, such as Egypt, Molda, etc. The objective of this study is to construct a predictive model for liver illness using Basic Formal Ontology (BFO) and detection rules derived from a decision tree algorithm. Based on these rules, events are detected through batch processing using the Apache Jena framework. Based on the event detected, queries can be directly processed using SPARQL. To make the ontology operational, these Decision Tree (DT) rules are converted into Semantic Web Rule Language (SWRL). Using this SWRL in the ontology for predicting different types of liver disease with the help of the Pellet and Drool inference engines in Protege Tools, a total of 615 records are taken from different liver diseases. After inferring the rules, the result can be generated for the patient according to the DT rules, and other patient-related details along with different precautionary suggestions can be obtained based on these results. Combining query results of batch processing and ontology-generated results can give more accurate suggestions for disease prevention and detection. This work aims to provide a comprehensive approach that is applicable for liver disease prediction, rich knowledge graph representation, and smart querying capabilities. The results show that combining RDF data, SWRL rules, and SPARQL queries for analysing and predicting liver disease can help medical professionals to learn more about liver diseases and make a Decision Support System (DSS) for health care.  ( 3 min )
    Quantum Machine Learning for Remote Sensing: Exploring potential and challenges. (arXiv:2311.07626v1 [quant-ph])
    The industry of quantum technologies is rapidly expanding, offering promising opportunities for various scientific domains. Among these emerging technologies, Quantum Machine Learning (QML) has attracted considerable attention due to its potential to revolutionize data processing and analysis. In this paper, we investigate the application of QML in the field of remote sensing. It is believed that QML can provide valuable insights for analysis of data from space. We delve into the common beliefs surrounding the quantum advantage in QML for remote sensing and highlight the open challenges that need to be addressed. To shed light on the challenges, we conduct a study focused on the problem of kernel value concentration, a phenomenon that adversely affects the runtime of quantum computers. Our findings indicate that while this issue negatively impacts quantum computer performance, it does not entirely negate the potential quantum advantage in QML for remote sensing.  ( 2 min )
    PolyIE: A Dataset of Information Extraction from Polymer Material Scientific Literature. (arXiv:2311.07715v1 [cs.CL])
    Scientific information extraction (SciIE), which aims to automatically extract information from scientific literature, is becoming more important than ever. However, there are no existing SciIE datasets for polymer materials, which is an important class of materials used ubiquitously in our daily lives. To bridge this gap, we introduce POLYIE, a new SciIE dataset for polymer materials. POLYIE is curated from 146 full-length polymer scholarly articles, which are annotated with different named entities (i.e., materials, properties, values, conditions) as well as their N-ary relations by domain experts. POLYIE presents several unique challenges due to diverse lexical formats of entities, ambiguity between entities, and variable-length relations. We evaluate state-of-the-art named entity extraction and relation extraction models on POLYIE, analyze their strengths and weaknesses, and highlight some difficult cases for these models. To the best of our knowledge, POLYIE is the first SciIE benchmark for polymer materials, and we hope it will lead to more research efforts from the community on this challenging task. Our code and data are available on: https://github.com/jerry3027/PolyIE.  ( 2 min )
    Large Language Models' Understanding of Math: Source Criticism and Extrapolation. (arXiv:2311.07618v1 [cs.LG])
    It has been suggested that large language models such as GPT-4 have acquired some form of understanding beyond the correlations among the words in text including some understanding of mathematics as well. Here, we perform a critical inquiry into this claim by evaluating the mathematical understanding of the GPT-4 model. Considering that GPT-4's training set is a secret, it is not straightforward to evaluate whether the model's correct answers are based on a mathematical understanding or based on replication of proofs that the model has seen before. We specifically craft mathematical questions which their formal proofs are not readily available on the web, proofs that are more likely not seen by the GPT-4. We see that GPT-4 is unable to solve those problems despite their simplicity. It is hard to find scientific evidence suggesting that GPT-4 has acquired an understanding of even basic mathematical concepts. A straightforward way to find failure modes of GPT-4 in theorem proving is to craft questions where their formal proofs are not available on the web. Our finding suggests that GPT-4's ability is to reproduce, rephrase, and polish the mathematical proofs that it has seen before, and not in grasping mathematical concepts. We also see that GPT-4's ability to prove mathematical theorems is continuously expanding over time despite the claim that it is a fixed model. We suggest that the task of proving mathematical theorems in formal language is comparable to the methods used in search engines such as Google while predicting the next word in a sentence may be a misguided approach, a recipe that often leads to excessive extrapolation and eventual failures. Prompting the GPT-4 over and over may benefit the GPT-4 and the OpenAI, but we question whether it is valuable for machine learning or for theorem proving.  ( 3 min )
    Graph GOSPA metric: a metric to measure the discrepancy between graphs of different sizes. (arXiv:2311.07596v1 [cs.SI])
    This paper proposes a metric to measure the dissimilarity between graphs that may have a different number of nodes. The proposed metric extends the generalised optimal subpattern assignment (GOSPA) metric, which is a metric for sets, to graphs. The proposed graph GOSPA metric includes costs associated with node attribute errors for properly assigned nodes, missed and false nodes and edge mismatches between graphs. The computation of this metric is based on finding the optimal assignments between nodes in the two graphs, with the possibility of leaving some of the nodes unassigned. We also propose a lower bound for the metric, which is also a metric for graphs and is computable in polynomial time using linear programming. The metric is first derived for undirected unweighted graphs and it is then extended to directed and weighted graphs. The properties of the metric are demonstrated via simulated and empirical datasets.  ( 2 min )
    Reinforcement Learning for Solving Stochastic Vehicle Routing Problem. (arXiv:2311.07708v1 [cs.AI])
    This study addresses a gap in the utilization of Reinforcement Learning (RL) and Machine Learning (ML) techniques in solving the Stochastic Vehicle Routing Problem (SVRP) that involves the challenging task of optimizing vehicle routes under uncertain conditions. We propose a novel end-to-end framework that comprehensively addresses the key sources of stochasticity in SVRP and utilizes an RL agent with a simple yet effective architecture and a tailored training method. Through comparative analysis, our proposed model demonstrates superior performance compared to a widely adopted state-of-the-art metaheuristic, achieving a significant 3.43% reduction in travel costs. Furthermore, the model exhibits robustness across diverse SVRP settings, highlighting its adaptability and ability to learn optimal routing strategies in varying environments. The publicly available implementation of our framework serves as a valuable resource for future research endeavors aimed at advancing RL-based solutions for SVRP.  ( 2 min )
    Matching aggregate posteriors in the variational autoencoder. (arXiv:2311.07693v1 [cs.LG])
    The variational autoencoder (VAE) is a well-studied, deep, latent-variable model (DLVM) that efficiently optimizes the variational lower bound of the log marginal data likelihood and has a strong theoretical foundation. However, the VAE's known failure to match the aggregate posterior often results in \emph{pockets/holes} in the latent distribution (i.e., a failure to match the prior) and/or \emph{posterior collapse}, which is associated with a loss of information in the latent space. This paper addresses these shortcomings in VAEs by reformulating the objective function associated with VAEs in order to match the aggregate/marginal posterior distribution to the prior. We use kernel density estimate (KDE) to model the aggregate posterior in high dimensions. The proposed method is named the \emph{aggregate variational autoencoder} (AVAE) and is built on the theoretical framework of the VAE. Empirical evaluation of the proposed method on multiple benchmark data sets demonstrates the effectiveness of the AVAE relative to state-of-the-art (SOTA) methods.  ( 2 min )
    Identification of Books That are Suitable for Middle School Students Using Artificial Neural Networks. (arXiv:2311.07591v1 [cs.CL])
    Reading right books contributes to children's imagination and brain development, enhances their language and emotional comprehension abilities, and strengthens their relationships with others. Building upon the critical role of reading books in individual development, this paper aims to develop an algorithm that determines the suitability of books for middle school students by analyzing their structural and semantic features. Using methods described, an algorithm will be created that can be utilized by institutions and individuals responsible for children's education, such as the Ministry of National Education officials and schools. This algorithm will facilitate the selection of books to be taught at the middle school level. With the algorithm, the book selection process for the middle school curriculum can be expedited, and it will serve as a preliminary reference source for those who evaluate books by reading them. In this paper, the Python programming language was employed, utilizing natural language processing methods. Additionally, an artificial neural network (ANN) was trained using the data which had been preprocessed to construct an original dataset. To train this network, suitable books for middle school students were provided by the MEB, Oxford and Cambridge and with content assessed based on the "R" criterion, and inappropriate books for middle school students in terms of content were included. This trained neural network achieved a 90.06% consistency rate in determining the appropriateness of the test-provided books. Considering the obtained findings, it can be concluded that the developed software has achieved the desired objective.  ( 3 min )
    The Disagreement Problem in Faithfulness Metrics. (arXiv:2311.07763v1 [cs.LG])
    The field of explainable artificial intelligence (XAI) aims to explain how black-box machine learning models work. Much of the work centers around the holy grail of providing post-hoc feature attributions to any model architecture. While the pace of innovation around novel methods has slowed down, the question remains of how to choose a method, and how to make it fit for purpose. Recently, efforts around benchmarking XAI methods have suggested metrics for that purpose -- but there are many choices. That bounty of choice still leaves an end user unclear on how to proceed. This paper focuses on comparing metrics with the aim of measuring faithfulness of local explanations on tabular classification problems -- and shows that the current metrics don't agree; leaving users unsure how to choose the most faithful explanations.  ( 2 min )
    Application of a Dense Fusion Attention Network in Fault Diagnosis of Centrifugal Fan. (arXiv:2311.07614v1 [cs.LG])
    Although the deep learning recognition model has been widely used in the condition monitoring of rotating machinery. However, it is still a challenge to understand the correspondence between the structure and function of the model and the diagnosis process. Therefore, this paper discusses embedding distributed attention modules into dense connections instead of traditional dense cascading operations. It not only decouples the influence of space and channel on fault feature adaptive recalibration feature weights, but also forms a fusion attention function. The proposed dense fusion focuses on the visualization of the network diagnosis process, which increases the interpretability of model diagnosis. How to continuously and effectively integrate different functions to enhance the ability to extract fault features and the ability to resist noise is answered. Centrifugal fan fault data is used to verify this network. Experimental results show that the network has stronger diagnostic performance than other advanced fault diagnostic models.  ( 2 min )
  • Open

    Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes. (arXiv:2212.05949v3 [stat.ML] UPDATED)
    Despite the significant interest and progress in reinforcement learning (RL) problems with adversarial corruption, current works are either confined to the linear setting or lead to an undesired $\tilde{O}(\sqrt{T}\zeta)$ regret bound, where $T$ is the number of rounds and $\zeta$ is the total amount of corruption. In this paper, we consider the contextual bandit with general function approximation and propose a computationally efficient algorithm to achieve a regret of $\tilde{O}(\sqrt{T}+\zeta)$. The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit and a new weighted estimator of uncertainty for the general function class. In contrast to the existing analysis that heavily relies on the linear structure, we develop a novel technique to control the sum of weighted uncertainty, thus establishing the final regret bounds. We then generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $\zeta$ in the scenario of general function approximation. Notably, our algorithms achieve regret bounds either nearly match the performance lower bound or improve the existing methods for all the corruption levels and in both known and unknown $\zeta$ cases.  ( 2 min )
    A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation. (arXiv:2106.06854v2 [cs.LG] UPDATED)
    Marginalized importance sampling (MIS), which measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution, is a promising approach for off-policy evaluation. However, current state-of-the-art MIS methods rely on complex optimization tricks and succeed mostly on simple toy problems. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep reinforcement learning methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.
    Iterative missing value imputation based on feature importance. (arXiv:2311.08005v1 [cs.LG])
    Many datasets suffer from missing values due to various reasons,which not only increases the processing difficulty of related tasks but also reduces the accuracy of classification. To address this problem, the mainstream approach is to use missing value imputation to complete the dataset. Existing imputation methods estimate the missing parts based on the observed values in the original feature space, and they treat all features as equally important during data completion, while in fact different features have different importance. Therefore, we have designed an imputation method that considers feature importance. This algorithm iteratively performs matrix completion and feature importance learning, and specifically, matrix completion is based on a filling loss that incorporates feature importance. Our experimental analysis involves three types of datasets: synthetic datasets with different noisy features and missing values, real-world datasets with artificially generated missing values, and real-world datasets originally containing missing values. The results on these datasets consistently show that the proposed method outperforms the existing five imputation algorithms.To the best of our knowledge, this is the first work that considers feature importance in the imputation model.
    DynaConF: Dynamic Forecasting of Non-Stationary Time-Series. (arXiv:2209.08411v2 [cs.LG] UPDATED)
    Deep learning has shown impressive results in a variety of time series forecasting tasks, where modeling the conditional distribution of the future given the past is the essence. However, when this conditional distribution is non-stationary, it poses challenges for these models to learn consistently and to predict accurately. In this work, we propose a new method to model non-stationary conditional distributions over time by clearly decoupling stationary conditional distribution modeling from non-stationary dynamics modeling. Our method is based on a Bayesian dynamic model that can adapt to conditional distribution changes and a deep conditional distribution model that handles multivariate time series using a factorized output space. Our experimental results on synthetic and real-world datasets show that our model can adapt to non-stationary time series better than state-of-the-art deep learning solutions.
    Unbiased Learning of Deep Generative Models with Structured Discrete Representations. (arXiv:2306.08230v2 [cs.LG] UPDATED)
    By composing graphical models with deep learning architectures, we learn generative models with the strengths of both frameworks. The structured variational autoencoder (SVAE) inherits structure and interpretability from graphical models, and flexible likelihoods for high-dimensional data from deep learning, but poses substantial optimization challenges. We propose novel algorithms for learning SVAEs, and are the first to demonstrate the SVAE's ability to handle multimodal uncertainty when data is missing by incorporating discrete latent variables. Our memory-efficient implicit differentiation scheme makes the SVAE tractable to learn via gradient descent, while demonstrating robustness to incomplete optimization. To more rapidly learn accurate graphical model parameters, we derive a method for computing natural gradients without manual derivations, which avoids biases found in prior work. These optimization innovations enable the first comparisons of the SVAE to state-of-the-art time series models, where the SVAE performs competitively while learning interpretable and structured discrete data representations.
    Axiomatization of Interventional Probability Distributions. (arXiv:2305.04479v2 [math.ST] UPDATED)
    Causal intervention is an essential tool in causal inference. It is axiomatized under the rules of do-calculus in the case of structure causal models. We provide simple axiomatizations for families of probability distributions to be different types of interventional distributions. Our axiomatizations neatly lead to a simple and clear theory of causality that has several advantages: it does not need to make use of any modeling assumptions such as those imposed by structural causal models; it only relies on interventions on single variables; it includes most cases with latent variables and causal cycles; and more importantly, it does not assume the existence of an underlying true causal graph as we do not take it as the primitive object--in fact, a causal graph is derived as a by-product of our theory. We show that, under our axiomatizations, the intervened distributions are Markovian to the defined intervened causal graphs, and an observed joint probability distribution is Markovian to the obtained causal graph; these results are consistent with the case of structural causal models, and as a result, the existing theory of causal inference applies. We also show that a large class of natural structural causal models satisfy the theory presented here. We note that the aim of this paper is axiomatization of interventional families, which is subtly different from "causal modeling."
    The convergence of the Stochastic Gradient Descent (SGD) : a self-contained proof. (arXiv:2103.14350v2 [stat.ML] UPDATED)
    We give here a proof of the convergence of the Stochastic Gradient Descent (SGD) in a self-contained manner.
    Mixture of Coupled HMMs for Robust Modeling of Multivariate Healthcare Time Series. (arXiv:2311.07867v1 [cs.LG])
    Analysis of multivariate healthcare time series data is inherently challenging: irregular sampling, noisy and missing values, and heterogeneous patient groups with different dynamics violating exchangeability. In addition, interpretability and quantification of uncertainty are critically important. Here, we propose a novel class of models, a mixture of coupled hidden Markov models (M-CHMM), and demonstrate how it elegantly overcomes these challenges. To make the model learning feasible, we derive two algorithms to sample the sequences of the latent variables in the CHMM: samplers based on (i) particle filtering and (ii) factorized approximation. Compared to existing inference methods, our algorithms are computationally tractable, improve mixing, and allow for likelihood estimation, which is necessary to learn the mixture model. Experiments on challenging real-world epidemiological and semi-synthetic data demonstrate the advantages of the M-CHMM: improved data fit, capacity to efficiently handle missing and noisy measurements, improved prediction accuracy, and ability to identify interpretable subsets in the data.
    Generalized partitioned local depth. (arXiv:2303.10167v4 [stat.ML] UPDATED)
    In this paper we provide a generalization of the concept of cohesion as introduced recently by Berenhaut, Moore and Melvin [Proceedings of the National Academy of Sciences, 119 (4) (2022)]. The formulation presented builds on the technique of partitioned local depth by distilling two key probabilistic concepts: local relevance and support division. Earlier results are extended within the new context, and examples of applications to revealing communities in data with uncertainty are included. The work sheds light on the foundations of partitioned local depth, and extends the original ideas to enable probabilistic consideration of uncertain, variable and potentially conflicting information.
    Linear Partial Monitoring for Sequential Decision-Making: Algorithms, Regret Bounds and Applications. (arXiv:2302.03683v2 [cs.LG] UPDATED)
    Partial monitoring is an expressive framework for sequential decision-making with an abundance of applications, including graph-structured and dueling bandits, dynamic pricing and transductive feedback models. We survey and extend recent results on the linear formulation of partial monitoring that naturally generalizes the standard linear bandit setting. The main result is that a single algorithm, information-directed sampling (IDS), is (nearly) worst-case rate optimal in all finite-action games. We present a simple and unified analysis of stochastic partial monitoring, and further extend the model to the contextual and kernelized setting.  ( 2 min )
    A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity. (arXiv:2302.06015v3 [cs.LG] UPDATED)
    Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical success in many vision tasks. Due to non-convex interactions across layers, however, theoretical learning and generalization analysis is mostly elusive. Based on a data model characterizing both label-relevant and label-irrelevant tokens, this paper provides the first theoretical analysis of training a shallow ViT, i.e., one self-attention layer followed by a two-layer perceptron, for a classification task. We characterize the sample complexity to achieve a zero generalization error. Our sample complexity bound is positively correlated with the inverse of the fraction of label-relevant tokens, the token noise level, and the initial model error. We also prove that a training process using stochastic gradient descent (SGD) leads to a sparse attention map, which is a formal verification of the general intuition about the success of attention. Moreover, this paper indicates that a proper token sparsification can improve the test performance by removing label-irrelevant and/or noisy tokens, including spurious correlations. Empirical experiments on synthetic data and CIFAR-10 dataset justify our theoretical results and generalize to deeper ViTs.  ( 2 min )
    Low-rank variational Bayes correction to the Laplace method. (arXiv:2111.12945v2 [stat.ME] UPDATED)
    Approximate inference methods like the Laplace method, Laplace approximations and variational methods, amongst others, are popular methods when exact inference is not feasible due to the complexity of the model or the abundance of data. In this paper we propose a hybrid approximate method called Low-Rank Variational Bayes correction (VBC), that uses the Laplace method and subsequently a Variational Bayes correction in a lower dimension, to the joint posterior mean. The cost is essentially that of the Laplace method which ensures scalability of the method, in both model complexity and data size. Models with fixed and unknown hyperparameters are considered, for simulated and real examples, for small and large datasets.  ( 2 min )
    Transformers can optimally learn regression mixture models. (arXiv:2311.08362v1 [cs.LG])
    Mixture models arise in many regression problems, but most methods have seen limited adoption partly due to these algorithms' highly-tailored and model-specific nature. On the other hand, transformers are flexible, neural sequence models that present the intriguing possibility of providing general-purpose prediction methods, even in this mixture setting. In this work, we investigate the hypothesis that transformers can learn an optimal predictor for mixtures of regressions. We construct a generative process for a mixture of linear regressions for which the decision-theoretic optimal procedure is given by data-driven exponential weights on a finite set of parameters. We observe that transformers achieve low mean-squared error on data generated via this process. By probing the transformer's output at inference time, we also show that transformers typically make predictions that are close to the optimal predictor. Our experiments also demonstrate that transformers can learn mixtures of regressions in a sample-efficient fashion and are somewhat robust to distribution shifts. We complement our experimental observations by proving constructively that the decision-theoretic optimal procedure is indeed implementable by a transformer.  ( 2 min )
    High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise. (arXiv:2310.18784v2 [cs.LG] UPDATED)
    Several recent works have studied the convergence \textit{in high probability} of stochastic gradient descent (SGD) and its clipped variant. Compared to vanilla SGD, clipped SGD is practically more stable and has the additional theoretical benefit of logarithmic dependence on the failure probability. However, the convergence of other practical nonlinear variants of SGD, e.g., sign SGD, quantized SGD and normalized SGD, that achieve improved communication efficiency or accelerated convergence is much less understood. In this work, we study the convergence bounds \textit{in high probability} of a broad class of nonlinear SGD methods. For strongly convex loss functions with Lipschitz continuous gradients, we prove a logarithmic dependence on the failure probability, even when the noise is heavy-tailed. Strictly more general than the results for clipped SGD, our results hold for any nonlinearity with bounded (component-wise or joint) outputs, such as clipping, normalization, and quantization. Further, existing results with heavy-tailed noise assume bounded $\eta$-th central moments, with $\eta \in (1,2]$. In contrast, our refined analysis works even for $\eta=1$, strictly relaxing the noise moment assumptions in the literature.  ( 2 min )
    Visualizing the Diversity of Representations Learned by Bayesian Neural Networks. (arXiv:2201.10859v2 [cs.LG] UPDATED)
    Explainable Artificial Intelligence (XAI) aims to make learning machines less opaque, and offers researchers and practitioners various tools to reveal the decision-making strategies of neural networks. In this work, we investigate how XAI methods can be used for exploring and visualizing the diversity of feature representations learned by Bayesian Neural Networks (BNNs). Our goal is to provide a global understanding of BNNs by making their decision-making strategies a) visible and tangible through feature visualizations and b) quantitatively measurable with a distance measure learned by contrastive learning. Our work provides new insights into the \emph{posterior} distribution in terms of human-understandable feature information with regard to the underlying decision making strategies. The main findings of our work are the following: 1) global XAI methods can be applied to explain the diversity of decision-making strategies of BNN instances, 2) Monte Carlo dropout with commonly used Dropout rates exhibit increased diversity in feature representations compared to the multimodal posterior approximation of MultiSWAG, 3) the diversity of learned feature representations highly correlates with the uncertainty estimate for the output and 4) the inter-mode diversity of the multimodal posterior decreases as the network width increases, while the intra mode diversity increases. These findings are consistent with the recent Deep Neural Networks theory, providing additional intuitions about what the theory implies in terms of humanly understandable concepts.  ( 3 min )
    PAC-Bayesian Generalization Bounds for Adversarial Generative Models. (arXiv:2302.08942v4 [cs.LG] UPDATED)
    We extend PAC-Bayesian theory to generative models and develop generalization bounds for models based on the Wasserstein distance and the total variation distance. Our first result on the Wasserstein distance assumes the instance space is bounded, while our second result takes advantage of dimensionality reduction. Our results naturally apply to Wasserstein GANs and Energy-Based GANs, and our bounds provide new training objectives for these two. Although our work is mainly theoretical, we perform numerical experiments showing non-vacuous generalization bounds for Wasserstein GANs on synthetic datasets.  ( 2 min )
    Time-Uniform Confidence Spheres for Means of Random Vectors. (arXiv:2311.08168v1 [math.ST])
    We derive and study time-uniform confidence spheres - termed confidence sphere sequences (CSSs) - which contain the mean of random vectors with high probability simultaneously across all sample sizes. Inspired by the original work of Catoni and Giulini, we unify and extend their analysis to cover both the sequential setting and to handle a variety of distributional assumptions. More concretely, our results include an empirical-Bernstein CSS for bounded random vectors (resulting in a novel empirical-Bernstein confidence interval), a CSS for sub-$\psi$ random vectors, and a CSS for heavy-tailed random vectors based on a sequentially valid Catoni-Giulini estimator. Finally, we provide a version of our empirical-Bernstein CSS that is robust to contamination by Huber noise.  ( 2 min )
    Causal Message Passing: A Method for Experiments with Unknown and General Network Interference. (arXiv:2311.08340v1 [stat.ME])
    Randomized experiments are a powerful methodology for data-driven evaluation of decisions or interventions. Yet, their validity may be undermined by network interference. This occurs when the treatment of one unit impacts not only its outcome but also that of connected units, biasing traditional treatment effect estimations. Our study introduces a new framework to accommodate complex and unknown network interference, moving beyond specialized models in the existing literature. Our framework, which we term causal message-passing, is grounded in a high-dimensional approximate message passing methodology and is specifically tailored to experimental design settings with prevalent network interference. Utilizing causal message-passing, we present a practical algorithm for estimating the total treatment effect and demonstrate its efficacy in four numerical scenarios, each with its unique interference structure.  ( 2 min )
    Introducing an Improved Information-Theoretic Measure of Predictive Uncertainty. (arXiv:2311.08309v1 [cs.LG])
    Applying a machine learning model for decision-making in the real world requires to distinguish what the model knows from what it does not. A critical factor in assessing the knowledge of a model is to quantify its predictive uncertainty. Predictive uncertainty is commonly measured by the entropy of the Bayesian model average (BMA) predictive distribution. Yet, the properness of this current measure of predictive uncertainty was recently questioned. We provide new insights regarding those limitations. Our analyses show that the current measure erroneously assumes that the BMA predictive distribution is equivalent to the predictive distribution of the true model that generated the dataset. Consequently, we introduce a theoretically grounded measure to overcome these limitations. We experimentally verify the benefits of our introduced measure of predictive uncertainty. We find that our introduced measure behaves more reasonably in controlled synthetic tasks. Moreover, our evaluations on ImageNet demonstrate that our introduced measure is advantageous in real-world applications utilizing predictive uncertainty.  ( 2 min )
    Learning Adversarial Low-rank Markov Decision Processes with Unknown Transition and Full-information Feedback. (arXiv:2311.07876v1 [cs.LG])
    In this work, we study the low-rank MDPs with adversarially changed losses in the full-information feedback setting. In particular, the unknown transition probability kernel admits a low-rank matrix decomposition \citep{REPUCB22}, and the loss functions may change adversarially but are revealed to the learner at the end of each episode. We propose a policy optimization-based algorithm POLO, and we prove that it attains the $\widetilde{O}(K^{\frac{5}{6}}A^{\frac{1}{2}}d\ln(1+M)/(1-\gamma)^2)$ regret guarantee, where $d$ is rank of the transition kernel (and hence the dimension of the unknown representations), $A$ is the cardinality of the action space, $M$ is the cardinality of the model class, and $\gamma$ is the discounted factor. Notably, our algorithm is oracle-efficient and has a regret guarantee with no dependence on the size of potentially arbitrarily large state space. Furthermore, we also prove an $\Omega(\frac{\gamma^2}{1-\gamma} \sqrt{d A K})$ regret lower bound for this problem, showing that low-rank MDPs are statistically more difficult to learn than linear MDPs in the regret minimization setting. To the best of our knowledge, we present the first algorithm that interleaves representation learning, exploration, and exploitation to achieve the sublinear regret guarantee for RL with nonlinear function approximation and adversarial losses.  ( 2 min )
    Modeling Complex Disease Trajectories using Deep Generative Models with Semi-Supervised Latent Processes. (arXiv:2311.08149v1 [cs.LG])
    In this paper, we propose a deep generative time series approach using latent temporal processes for modeling and holistically analyzing complex disease trajectories. We aim to find meaningful temporal latent representations of an underlying generative process that explain the observed disease trajectories in an interpretable and comprehensive way. To enhance the interpretability of these latent temporal processes, we develop a semi-supervised approach for disentangling the latent space using established medical concepts. By combining the generative approach with medical knowledge, we leverage the ability to discover novel aspects of the disease while integrating medical concepts into the model. We show that the learned temporal latent processes can be utilized for further data analysis and clinical hypothesis testing, including finding similar patients and clustering the disease into new sub-types. Moreover, our method enables personalized online monitoring and prediction of multivariate time series including uncertainty quantification. We demonstrate the effectiveness of our approach in modeling systemic sclerosis, showcasing the potential of our machine learning model to capture complex disease trajectories and acquire new medical knowledge.  ( 2 min )
    Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees. (arXiv:2311.08384v1 [cs.LG])
    Hybrid RL is the setting where an RL agent has access to both offline data and online data by interacting with the real-world environment. In this work, we propose a new hybrid RL algorithm that combines an on-policy actor-critic method with offline data. On-policy methods such as policy gradient and natural policy gradient (NPG) have shown to be more robust to model misspecification, though sometimes it may not be as sample efficient as methods that rely on off-policy learning. On the other hand, offline methods that depend on off-policy training often require strong assumptions in theory and are less stable to train in practice. Our new approach integrates a procedure of off-policy training on the offline data into an on-policy NPG framework. We show that our approach, in theory, can obtain a best-of-both-worlds type of result -- it achieves the state-of-art theoretical guarantees of offline RL when offline RL-specific assumptions hold, while at the same time maintaining the theoretical guarantees of on-policy NPG regardless of the offline RL assumptions' validity. Experimentally, in challenging rich-observation environments, we show that our approach outperforms a state-of-the-art hybrid RL baseline which only relies on off-policy policy optimization, demonstrating the empirical benefit of combining on-policy and off-policy learning. Our code is publicly available at https://github.com/YifeiZhou02/HNPG.  ( 2 min )
    Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data. (arXiv:2311.07762v1 [stat.ME])
    A mixture of multivariate Poisson-log normal factor analyzers is introduced by imposing constraints on the covariance matrix, which resulted in flexible models for clustering purposes. In particular, a class of eight parsimonious mixture models based on the mixtures of factor analyzers model are introduced. Variational Gaussian approximation is used for parameter estimation, and information criteria are used for model selection. The proposed models are explored in the context of clustering discrete data arising from RNA sequencing studies. Using real and simulated data, the models are shown to give favourable clustering performance. The GitHub R package for this work is available at https://github.com/anjalisilva/mixMPLNFA and is released under the open-source MIT license.  ( 2 min )
    CSLP-AE: A Contrastive Split-Latent Permutation Autoencoder Framework for Zero-Shot Electroencephalography Signal Conversion. (arXiv:2311.07788v1 [cs.LG])
    Electroencephalography (EEG) is a prominent non-invasive neuroimaging technique providing insights into brain function. Unfortunately, EEG data exhibit a high degree of noise and variability across subjects hampering generalizable signal extraction. Therefore, a key aim in EEG analysis is to extract the underlying neural activation (content) as well as to account for the individual subject variability (style). We hypothesize that the ability to convert EEG signals between tasks and subjects requires the extraction of latent representations accounting for content and style. Inspired by recent advancements in voice conversion technologies, we propose a novel contrastive split-latent permutation autoencoder (CSLP-AE) framework that directly optimizes for EEG conversion. Importantly, the latent representations are guided using contrastive learning to promote the latent splits to explicitly represent subject (style) and task (content). We contrast CSLP-AE to conventional supervised, unsupervised (AE), and self-supervised (contrastive learning) training and find that the proposed approach provides favorable generalizable characterizations of subject and task. Importantly, the procedure also enables zero-shot conversion between unseen subjects. While the present work only considers conversion of EEG, the proposed CSLP-AE provides a general framework for signal conversion and extraction of content (task activation) and style (subject variability) components of general interest for the modeling and analysis of biological signals.  ( 3 min )
    Frequentist Guarantees of Distributed (Non)-Bayesian Inference. (arXiv:2311.08214v1 [math.ST])
    Motivated by the need to analyze large, decentralized datasets, distributed Bayesian inference has become a critical research area across multiple fields, including statistics, electrical engineering, and economics. This paper establishes Frequentist properties, such as posterior consistency, asymptotic normality, and posterior contraction rates, for the distributed (non-)Bayes Inference problem among agents connected via a communication network. Our results show that, under appropriate assumptions on the communication graph, distributed Bayesian inference retains parametric efficiency while enhancing robustness in uncertainty quantification. We also explore the trade-off between statistical efficiency and communication efficiency by examining how the design and size of the communication graph impact the posterior contraction rate. Furthermore, We extend our analysis to time-varying graphs and apply our results to exponential family models, distributed logistic regression, and decentralized detection models.  ( 2 min )
    Ensemble sampling for linear bandits: small ensembles suffice. (arXiv:2311.08376v1 [stat.ML])
    We provide the first useful, rigorous analysis of ensemble sampling for the stochastic linear bandit setting. In particular, we show that, under standard assumptions, for a $d$-dimensional stochastic linear bandit with an interaction horizon $T$, ensemble sampling with an ensemble of size $m$ on the order of $d \log T$ incurs regret bounded by order $(d \log T)^{5/2} \sqrt{T}$. Ours is the first result in any structured setting not to require the size of the ensemble to scale linearly with $T$ -- which defeats the purpose of ensemble sampling -- while obtaining near $\sqrt{T}$ order regret. Ours is also the first result that allows infinite action sets.  ( 2 min )
    Feedforward neural networks as statistical models: Improving interpretability through uncertainty quantification. (arXiv:2311.08139v1 [stat.ME])
    Feedforward neural networks (FNNs) are typically viewed as pure prediction algorithms, and their strong predictive performance has led to their use in many machine-learning applications. However, their flexibility comes with an interpretability trade-off; thus, FNNs have been historically less popular among statisticians. Nevertheless, classical statistical theory, such as significance testing and uncertainty quantification, is still relevant. Supplementing FNNs with methods of statistical inference, and covariate-effect visualisations, can shift the focus away from black-box prediction and make FNNs more akin to traditional statistical models. This can allow for more inferential analysis, and, hence, make FNNs more accessible within the statistical-modelling context.  ( 2 min )
    Large Scale Prediction with Decision Trees. (arXiv:2104.13881v5 [stat.ML] UPDATED)
    This paper shows that decision trees constructed with Classification and Regression Trees (CART) and C4.5 methodology are consistent for regression and classification tasks, even when the number of predictor variables grows sub-exponentially with the sample size, under natural 0-norm and 1-norm sparsity constraints. The theory applies to a wide range of models, including (ordinary or logistic) additive regression models with component functions that are continuous, of bounded variation, or, more generally, Borel measurable. Consistency holds for arbitrary joint distributions of the predictor variables, thereby accommodating continuous, discrete, and/or dependent data. Finally, we show that these qualitative properties of individual trees are inherited by Breiman's random forests. A key step in the analysis is the establishment of an oracle inequality, which allows for a precise characterization of the goodness-of-fit and complexity tradeoff for a mis-specified model.  ( 2 min )
    A Fast and Simple Algorithm for computing the MLE of Amplitude Density Function Parameters. (arXiv:2311.07951v1 [stat.ME])
    Over the last decades, the family of $\alpha$-stale distributions has proven to be useful for modelling in telecommunication systems. Particularly, in the case of radar applications, finding a fast and accurate estimation for the amplitude density function parameters appears to be very important. In this work, the maximum likelihood estimator (MLE) is proposed for parameters of the amplitude distribution. To do this, the amplitude data are \emph{projected} on the horizontal and vertical axes using two simple transformations. It is proved that the \emph{projected} data follow a zero-location symmetric $\alpha$-stale distribution for which the MLE can be computed quite fast. The average of computed MLEs based on two \emph{projections} is considered as estimator for parameters of the amplitude distribution. Performance of the proposed \emph{projection} method is demonstrated through simulation study and analysis of two sets of real radar data.  ( 2 min )

  • Open

    You can predict disease progression by modeling health data in latent space
    Many complex diseases like autoimmune disorders have highly variable progression between patients, making them difficult to understand and predict. A new paper shows that visualizing health data in the latent space helps find hidden patterns in clinical data that can be useful in predicting disease progression. The key finding is they could forecast personalized progression patterns by modeling clinical data in a latent space. This conceptual space uses variables to represent hidden disease factors inferred from measurements. Researchers designed a generative model using variational autoencoders to map connections between raw patient data, expert labels, and these latent variables. When tested on thousands of real patients, the model showed promising ability to: Predict individualized future disease patterns and uncertainty Reveal interpretable trajectories showing progression Cluster patients into phenotypes with unique evolution Align predictions with biological knowledge While further validation is needed, this demonstrates a generalizable framework for gaining new understanding of multifaceted disease evolution, not just for one specific condition. The potential is to enable better monitoring, risk stratification, and treatment personalization for enigmatic diseases using AI to decode their complexity. TLDR: Researchers show AI modeling of clinical data in a tailored latent space could reveal new personalized insights into complex disease progression. Full summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 1 min )
    Meta to launch speaking AI chatbots played by celebrities
    Meta is launching speaking AI chatbots played by celebrities, such as Kendall Jenner, Snoop Dogg, and Tom Brady. The AI characters are based on famous personalities, influencers, and content creators. The profiles feature AI-generated images created from videos and studio photos of public figures. The AI-generated photos have 'Imagined with AI' watermarks and represent the personalities of the characters. The chat feature is disabled for many users, and the initiative has faced criticism for selling the rights to celebrities' faces and identities. Meta paid $5 million to the highest-profile celebrities for their participation. The celebrity chatbots have generated varying levels of online conversation and followers. Meta's focus on integrating artificial intelligence into their products aligns with their goal of fostering genuine connections. The venture reflects the prevalence of parasocial relationships and the economy of loneliness in today's society. Over 160 companies are developing technology products to address the social problem of loneliness. Mark Zuckerberg emphasized the importance of creating AI responsibly and announced that everyone will be able to create their own AI next year. Source : https://english.elpais.com/technology/2023-11-13/from-paris-hilton-to-snoop-dogg-meta-to-launch-speaking-ai-chatbots-played-by-celebrities.html submitted by /u/NuseAI [link] [comments]  ( 1 min )
    Jaron Lanier Looks into AI's Future | AI IRL
    Really interesting interview with Jaron Lanier submitted by /u/TrickyUniversity5885 [link] [comments]  ( 1 min )
    It’ll probably take 3,4,or 6 years for a new groundbreaking ai framework to change the world from its initial paper. This is considering how long it took for transformer architecture to be seen with potential and finally scaled to chatgpt and generative AI,3 if using the log rule.
    The founding paper for the recent boom in generative AI stems from the paper “attention is all you need” which postulated the transformer architecture in June 2017, it took roughly 1 year for it to start being scaled with the release of GPT:1 in June 2018, after that it took 4 more years for chatgpt to be first released. If we apply the log rule given the new advances in processing and the AI boom it’ll probs be in 3-4 years from the initial release date of the new foundational paper BUT we don’t know if that paper has been released yet or not so until then. submitted by /u/Impossible_Belt_7757 [link] [comments]  ( 1 min )
    Stripe is cracking down on AI companions - Why?
    While Poe from Quora is extempted from Stripe restricted businesses, even though they serve NSFW, bootstrapped startups like GirlfriendGPT get bullied for only allowing SFW conversations with it! submitted by /u/No-Net-4704 [link] [comments]  ( 1 min )
    WebsiteAI
    generated from Website Ai Tools test yours now our TG : Website_AIwebsite : WebsiteAi.io https://preview.redd.it/b880liwrfj0c1.png?width=423&format=png&auto=webp&s=c76092865e3f17645d9e7c936d27334609fb0382 submitted by /u/balok1232 [link] [comments]  ( 1 min )
    Jeremy Howard taught AI to the world and helped invent ChatGPT. He fears he's failed
    submitted by /u/Jariiari7 [link] [comments]  ( 1 min )
    chatGPT might be more useful than AGI
    If AGI is "human level" intelligence, (the v1.0) might be slow, prohibitively expensive and stupid. (AGI tier list) chatGPT costs something like 1c per second, so $60/hr. If you are paying for an artificial intelligence to slowly type, look things up, slowly read, forget things, sleep(?!) (and so on) it might seem a huge step backwards. :::: you can stop reading :::: TLDR: AGI v1 dumb & v expensive / chatGPT great! / chatGPT+more+more = meh / intelligence, hmm. Humany? hmm. / AGI ... crap at first. It's true that a real "general" intelligence would be profoundly amazing. Maybe you don't even need agency. :: This post came out of a joke, where an early-adopter AI enthusiast gets the first access to the first AGI and it slowly replies with "bro, wut" or "i dunno google it". Then goes …  ( 1 min )
    Command GPT - Expert at crafting commands for GPT customization.
    submitted by /u/Senior_tasteey [link] [comments]  ( 1 min )
    Grok is Elon Musk’s new sassy, foul-mouthed AI. But who exactly is it made for?
    submitted by /u/Jariiari7 [link] [comments]  ( 1 min )
    The American public strongly prefers human medical professionals make medical decisions, while at the same time believing they are more likely to make culturally biased decisions than AI.
    submitted by /u/jasonjonesresearch [link] [comments]  ( 1 min )
    Seeking Resources or Courses to Learn about Harnessing Recent Advancements in GPT for Work Applications
    Hello, Reddit community! I'm eager to tap into the potential of the latest advancements in GPT (Generative Pre-trained Transformer) for my work. Specifically, I'm interested in learning how to build a customized GPT model, train it using internal data from my organization, and improve my prompt engineering skills to enhance the model's output. If you have any recommendations or suggestions for resources that cover these topics, please share them below. I'm particularly interested in resources that are up-to-date, covering the latest advancements and techniques in GPT. Thank you in advance for your valuable insights! submitted by /u/flight862 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 11/14/2023
    Apple’s AI-powered Siri assistant could land as soon as WWDC 2024.[1] Vectara has published an AI hallucination leaderboard that ranks various leading AI chatbots according to their ability to not ‘hallucinate.'[2] An AI developed by Google can accurately predict the weather as many as ten days in advance, outperforming traditional forecasting methods and even top supercomputers.[3] YouTube creators will soon have to disclose use of gen AI in videos or risk suspension.[4] Sources: [1] https://www.techradar.com/phones/apple-plans-to-reinvent-siri-with-on-device-ai-for-the-iphone-16 [2] https://www.tomshardware.com/news/ai-hallucinations-ranked-chatgpt-best [3] https://www.dailymail.co.uk/news/article-12750437/They-didnt-coming-Google-DeepMind-AI-accurately-predict-weather-10-days-advance-leaving-forecasters-supercomputers-cold.html [4] https://blog.youtube/inside-youtube/our-approach-to-responsible-ai-innovation/ submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
  • Open

    [D] Usefulness of vector databases and their real-world applications
    I have been working on a presentation that sums up the main points why vector databases are often an unnecessary optimization that's too often being promoted by vector database vendors. The slides are available here: https://vec3.ai/ Do you think vector databases are overrated? In what instances have vector databases proved most useful in your projects, and were they commercial implementations? submitted by /u/Key_Story_2768 [link] [comments]  ( 9 min )
    [R] Modeling Complex Disease Trajectories using Deep Generative Models with Semi-Supervised Latent Processes
    Many complex diseases like autoimmune disorders have highly variable progression between patients, making them difficult to understand and predict. A new paper shows that visualizing health data in the latent space helps find hidden patterns in clinical data that can be useful in predicting disease progression. The key finding is they could forecast personalized progression patterns by modeling clinical data in a latent space. This conceptual space uses variables to represent hidden disease factors inferred from measurements. Researchers designed a generative model using variational autoencoders to map connections between raw patient data, expert labels, and these latent variables. When tested on thousands of real patients, the model showed promising ability to: Predict individualized future disease patterns and uncertainty Reveal interpretable trajectories showing progression Cluster patients into phenotypes with unique evolution Align predictions with biological knowledge While further validation is needed, this demonstrates a generalizable framework for gaining new understanding of multifaceted disease evolution, not just for one specific condition. The potential is to enable better monitoring, risk stratification, and treatment personalization for enigmatic diseases using AI to decode their complexity. TLDR: Researchers show AI modeling of clinical data in a tailored latent space could reveal new personalized insights into complex disease progression. Full summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] Multiple documents reveal significant limitations of OpenAI's Assistants API for RAG
    The OpenAI RAG system struggled with multiple documents, showing inconsistent performance with our evaluation framework. However, performance improved markedly when all documents were uploaded as a single document. Despite current limitations, such as a 20-file limit per assistant and challenges in handling multiple documents, there is significant potential for improvement. Enhancing the Assistants API to match GPT quality and reducing restrictions could make it a leading RAG solution. https://www.tonic.ai/blog/rag-evaluation-series-validating-openai-assistants-rag-performance submitted by /u/ian4321 [link] [comments]  ( 9 min )
    [P] SAM-GPT-4V for Specialized Object Detection, Segmentation
    Last week, I worked on SAM-GPT-4V, a model that combines Grounding DINO, SAM and GPT-4V. Grounding DINO identifies the car, SAM segments the car, then GPT-4V devises a more specific label (in this case, the car brand). Check out my code: https://github.com/capjamesg/sam-gpt4v https://preview.redd.it/cgwh61csfk0c1.png?width=1210&format=png&auto=webp&s=5403a1cd51635ab057fd85a64081add3a215c338 submitted by /u/zerojames_ [link] [comments]  ( 9 min )
    [D] What ML model to use for this mobility problem?
    Hi guys, I am asked to try and fit a ML model to a huge mobility dataset that i have, and I tried some models but fail to get a decent [performance metric], so i'd love a fresh set of eyes on this! Features of the dataset Each row represents the data for a certain "Origin-Destination" pair, for example pair "31 to 493" meaning this is from place 31 to place 493. The first feature is thus called pair. For each mode of transport (drive, bike, walk, transit) there are 3 "cost"-features namely: [mode]_time, [mode]_cost, [mode]_convenience. So there are 12 features in total (4 modes x 3 costs) Some extra features: average_income, cars_per_household, jobs_at_destination (representing the people travel in this pair 4 observed features, one for each mode. These are the features to predict. This is a value between 0 and 1, representing how much % of people in this pair, use this mode of transport. Additional information sometimes the 3 costs for "transit" are 999, meaning that there is no transit option (train, tram, ...) available for this pair. The usual costs lie between 0 and 100 I deleted the walk_cost feature because every entry was 0. Here are the distributions of all the features: ​ https://preview.redd.it/8lt001so6k0c1.png?width=2002&format=png&auto=webp&s=adfe645606746c008941e36fbb35261d0600a8bb https://preview.redd.it/yeqx4yro6k0c1.png?width=2046&format=png&auto=webp&s=011d05795c203d57d9402805377b0e8741c2378c https://preview.redd.it/7dtk0gso6k0c1.png?width=2042&format=png&auto=webp&s=6ef5ed2d5f9044dfede43a765ae19afb5c487486 And the correlation matrix: ​ https://preview.redd.it/61uwu9fx6k0c1.png?width=1718&format=png&auto=webp&s=d299ca5ffa97dadd45804a62ed2b17628f51d0f9 So the goal is to predict those 4 obsrv features. I am very curious which ML model you would use for this and why? If you have any other suggestions, e.g. pre-processing techniques on the data/features, do share! Thank you guys! submitted by /u/Zijdehoen [link] [comments]  ( 9 min )
    [D] Transformer embedding for ranking questions
    I am working on a project where I have very specific questions about a sports event. These questions are like "Will there a shot on goal or a free-kick first" or "Which team will commit first foul". I want to rank these questions based on a statement like "Liverpool will score today". Just converting them to embedding and calculating cosine similarity doesn't seem to work very well. Is there any way I can improve this? submitted by /u/diedFindingAUsername [link] [comments]  ( 9 min )
    ML course recommendation(Deeplearning AI vs UW ML specialization)[D]
    I know it's been already debated but looking for fresh perspectives. It will be my second course in ML. I've already taken one back in school. Im deciding between deeplearning.ai ML specialization vs UW Washington ML specialization. I've heard deeplearning.ai's now courses have been watered down a bit compared to original Stanford one taught by Andrew Ng. Is there any way to take the Stanford one without paying the Stanford Online fees? Anyway, which one would be the better of the two in my scenario? submitted by /u/ffaangcoder [link] [comments]  ( 9 min )
    [D] Transitioning to a Data Product Ecosystem: Leveraging the Evolutionary Architecture - 4D Architectures, Data-Driven Routing, Feature Toggles, and more!
    📃 𝐈𝐧 𝐭𝐡𝐢𝐬 𝐩𝐢𝐞𝐜𝐞, 𝐰𝐞 𝐞𝐱𝐩𝐥𝐨𝐫𝐞 𝐨𝐧 𝐚 𝐡𝐢𝐠𝐡 𝐥𝐞𝐯𝐞𝐥: ➡️ How data-driven routing helps archive legacy features while seamlessly moving to new functions. ➡️ How a 4D architectural view across time gives a concrete understanding of evolutionary transitions. ➡️ How unit tests on a flexible coding paradigm are more effective than absolute 'best practices'. ➡️ How feature toggles help test new features in production without ruining customer experience. ➡️ Conflict resolution between opposing features, fan-out deployment pipelines, and more. 📩 𝐑𝐞𝐚𝐝 𝐭𝐡𝐞 𝐜𝐨𝐦𝐩𝐥𝐞𝐭𝐞 𝐚𝐫𝐭𝐢𝐜𝐥𝐞 𝐡𝐞𝐫𝐞: https://moderndata101.substack.com/p/transitioning-to-a-data-product-ecosystem What are your thoughts on this? submitted by /u/growth_man [link] [comments]  ( 9 min )
    [D] Validating Claims about Theoretical 131K Token Attention Span in Mistral 7B
    The Mistral 7B paper claims a theoretical attention span of 131K tokens(via propagating information up through layers with GQA) for Mistral 7B. I'm trying to figure out how this is achieved in practice. The trick seems to be on line 128, with the branch `if positions.shape[0] > 1:`, which would typically be taken when the model is first called. From my understanding, taking this branch would compute k/v values for all provided tokens, which could then propagate information for an initial prompt state of up to 131K tokens throughout the model layers(due to the staggered nature of GQA). The line would never be evaluated again as additional tokens are subsequently added to the model cache one at a time as can be seen on line 332 `logits = model.forward(next_token[:, None], torch.LongTensor([cur_pos]).to(next_token)) `. I will note the the cache's default size is 4096, which also seems to give us the sliding attention window the paper refers to. All 32 transformer block layers of Mistral seem to compute attention scores using only the cache whenever only one token is provided to the model.forward() call, otherwise, the attention scores are computed without the cache. Is my current understanding correct? Paper link: https://ar5iv.labs.arxiv.org/html/2310.06825#S2.F1 Code where scores are generated directly from cache: https://github.com/mistralai/mistral-src/blob/main/one_file_ref.py#L125-L140 submitted by /u/TheRealBracketMaster [link] [comments]  ( 9 min )
    [R] With or without a scratchpad, Large Language Models can Strategically Deceive their Users when Put Under Pressure. Results of an autonomous stock trading agent in a realistic, simulated environment.
    Paper: https://arxiv.org/abs/2311.07590 Abstract: We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision. We perform a brief investigation of how this behavior varies under changes to the setting, such as removing model access to a reasoning scratchpad, attempting to prevent the misaligned behavior by changing system instructions, changing the amount of pressure the model is under, varying the perceived risk of getting caught, and making other simple changes to the environment. To our knowledge, this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception. submitted by /u/MysteryInc152 [link] [comments]  ( 9 min )
    [D] What is the future for ML researchers and startups?
    With the advent of LLMs, multimodality and “general purpose” AIs which seat on unimaginable amounts money, computing power and data. I’m graduating and want to start a PHD, but feel quite disheartened given the huge results obtained simply by “brute-forcing” and by the ever-growing hype in machine learning that could result in a bubble of data scientists, ML researchers and so on. submitted by /u/francMesina [link] [comments]  ( 9 min )
    MLP-Mixer Interpretability [D][R]
    Hello everyone, I am just looking for materials and papers to interpret the features learnt by MLP-Mixer models. For ConvNets we can plot convolutional kernels and pattern that maximally activate units, so we have at least a vague idea of what the net captures and responds to. For Vision Transformers, linear embedding is analogue to the first conv layer and we can still see things such as color blurs, edge detectors, color filters. For self-supervised pretrained models we sometimes see CLS token at the final layer activate as saliency maps. Probably the patchification in MLP-Mixers is the same as for ViTs, are there any inspections and study available to further analyse this? submitted by /u/reverendCappuccino [link] [comments]  ( 9 min )
    [P] Curated list for data-centric AI tooling / LLM update / contributions welcome
    The tooling space for LLMs and RAG is evolving fast. We updated our “Awesome Open Data Centric AI" list to reflect that: https://github.com/Renumics/awesome-open-data-centric-ai Unsurprisingly, the biggest changes are in the "Security" and the "Observability" category. Also, it is great to see that an increasing amount of tools revolve around qualitative data understanding (instead of quantiative metrics) and data curation. https://preview.redd.it/ejqfdxfdxh0c1.png?width=1200&format=png&auto=webp&s=fb73e360df8e7ee5b6a33004e987dfe665000995 Do you think there is a tool missing? Feel free to create a pull request on Github! ​ ​ submitted by /u/44sps [link] [comments]  ( 9 min )
    [P] Search Result Ranking for Loan Comparison Website
    I just got invited to an interview with a loan comparison website (e.g. NerdWallet or Credible). They are looking for a Product Manager who can help them with an 'AI-powered ranking of comparison results' feature (e.g. Offer A is more suitable for you than offer B). No deep technical knowledge is needed for this position - however, I am interested in that use case and would like to hear your opinion on how I approach it. 'AI-powered ranking of comparison results' - Case Study I do assume that they are/should be using an 'Learning to Rank' algorithm - specifically a listwise approach(intro-to-LTR) as it is specifically designed to solve recommendation / relevance ranking problems. Some data they have on a query-side is, among others, loan amount, loan term and loan protection as well as customer-related data, such as age, employment status, income. On the search results side, they do have information such as financial institution name, eligibility, interest rate, annual percentage rate etc. For labelling, they probably have information on which customer chose (or at least clicked on) which results based on which query. In python, one might follow https://github.com/shiba24/learning2rank's approach and use learning2rank.rank's ListNet library where X is our query-side and customer-related data and y are number of completed transactions, click-through rate etc. The output should (after optimization) rank search results based on their relevancy with regard to the query and the customer-related data (at least that is what I am hoping). I know this is on a very high level but I wanted to break it down just a little bit and try to understand how to approach it on a technical level. What do you think? Am I completely wrong somewhere? Did I miss something? ​ submitted by /u/WuhuSpringfield [link] [comments]  ( 10 min )
    [P] Distil-Whisper: a distilled variant of Whisper that is 6x faster
    Introducing Distil-Whisper: 6x faster than Whisper while performing to within 1% WER on out-of-distribution test data. Through careful data selection and filtering, Whisper's robustness to noise is maintained and hallucinations reduced. For more information, refer to: 👨‍💻 The GitHub repo: https://github.com/huggingface/distil-whisper 📚 The official paper: https://arxiv.org/abs/2311.00430 Here's a quick overview of how it works: 1. Distillation The Whisper encoder performs 1 forward pass, while the decoder performs as many as the number of tokens generated. That means that the decoder accounts for >90% of the total inference time. Therefore, reducing decoder layers is more effective than encoder layers. With this in mind, we keep the whole encoder, but only 2 decoder layers.…  ( 10 min )
    [D] RetNet and Transformers+FlashAttention
    Hello, everyone. I've recently read the RetNet paper and am curious about its comparability with Transformers + FlashAttention 2.0. While their paper suggests that RetNet can be comparable to Transformers + FlashAttention (not the 2.0 version), I'm also curious to know if it can maintain comparability with Transformers at a larger scale. I'm interested in understanding more about this comparison. Thanks for answering 🙏 submitted by /u/Ok-Literature5484 [link] [comments]  ( 9 min )
    [D] Help with extracting citations from text
    Hi everyone. I have a volunteering project and the goal is to extract the citations (not in-text citations, usually the citations at the end) from pdfs. We converted the pdfs into strings of texts. But since the OCR module (pytesseract) did not output clean texts and formats, there are several misspelling words and weird symbols. I have attempted to use regex to extract citations. But I failed because there is not a fixed set of custom rules that I can come up with to take all cases into account. Do you have any experience with this? I tried ChatGPT by pasting a sample text and it gave me such a clean and accurate result of citations. But I want to build a model myself since we are working on our own task which is automating the process of extracting citations from many pdf documents. I have looked a bit into some LLMs that focus on GPT because I think only generative text models can help us. Text classification models or so are less flexible in our case. Should I attempt to pretrain a model? I know LLaMA-2 7gb is good? But not sure if it’s free. OpenAI APIs are out of scope due to our financial constraint. I am relatively new to NLP but I’m confident about my Python programming and statistical techniques. What is your suggestions? Any thought is appreciated. Thank you for your time! submitted by /u/choyakishu [link] [comments]  ( 9 min )
    [R] Source Code Data Augmentation for Deep Learning: A Survey
    Paper: https://arxiv.org/abs/2305.19915 GitHub: https://github.com/terryyz/DataAug4Code Abstract: The increasingly popular adoption of deep learning models in many critical source code tasks motivates the development of data augmentation (DA) techniques to enhance training data and improve various capabilities (e.g., robustness and generalizability) of these models. Although a series of DA methods have been proposed and tailored for source code models, there lacks a comprehensive survey and examination to understand their effectiveness and implications. This paper fills this gap by conducting a comprehensive and integrative survey of data augmentation for source code, wherein we systematically compile and encapsulate existing literature to provide a comprehensive overview of the field. We start with an introduction of data augmentation in source code and then provide a discussion on major representative approaches. Next, we highlight the general strategies and techniques to optimize the DA quality. Subsequently, we underscore techniques useful in real-world source code scenarios and downstream tasks. Finally, we outline the prevailing challenges and potential opportunities for future research. In essence, we aim to demystify the corpus of existing literature on source code DA for deep learning, and foster further exploration in this sphere. Complementing this, we present a continually updated GitHub repository that hosts a list of update-to-date papers on DA for source code modeling, accessible at \url{this https URL}. submitted by /u/International-Rip958 [link] [comments]  ( 9 min )
    [D] Table Data Labeling
    I've got around 50k pages of proprietary table data I'm hoping to fine-tune Microsoft's table-tranformer on. Has anyone found a good solution to annotating table data?The solutions I've found thus far require me to manually label every cell. It'd be great if there was a labeling tool that autolabels and lets me make corrections as needed (or at least lets me label entire rows/columns at once) submitted by /u/Reddit-1-account [link] [comments]  ( 9 min )
    [R] GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
    Enabling AI to navigate and interact with smartphone UIs is hard, requiring a model that goes beyond mere text processing to handle intricate visual and interactive tasks. A new paper proposes MM-Navigator, an agent based on GPT-4V that can use an iPhone and make purchases on the Amazon app. The agent can "understand" and interact with smartphone interfaces in a much more human-like manner than previous attempts. The key innovation lies in GPT-4V's ability to process both text and image inputs. The agent takes a user's text instructions and the current screen image, then outputs a description of the next action, including precise screen locations. The researchers improved interaction accuracy by adding numeric tags to interactive elements on the screen, which GPT-4V references to indicat…  ( 10 min )
  • Open

    Question as a Layman - RNN versus FNN in ChatGPT
    Hi, I am having a discussion with my friend and we are talking about the distinction between recurrent neural networks and feed forward neural networks and the topic of ChatGPT came up and consciousness and all of that jazz. My question is basically this: ChatGPT is typically characterized as a feedforward neural network as far as I know because it lacks "internal feedback loops" which are present in recurrent neural networks. I am wondering if the training of ChatGPT can be considered a form of internal feedback? It doesn't have internal feedback loops within its primary structure, but it could still have internal feedback loops more Macroscopically, right? And if that is the case, isn't defining ChatGPT as recurrent or feedforward sort of a matter of how high up in the system's architecture you are viewing it? And then if that is the case, and we agree that we view a system from its entire architecture, even temporally, then we would need to consider Chat GPT as a recurrent neural network - albeit on with significantly less recurrence than if it had recurrent modules inside each node or "neuron", but still recurrent no doubt. submitted by /u/platonic2257 [link] [comments]  ( 1 min )
    Experimental brain-like computing system more accurate with custom algorithm
    submitted by /u/keghn [link] [comments]  ( 1 min )
  • Open

    Book Recommendation
    Since early 21st century, artificial intelligence (AI) has been reshaping almost all areas of human society, which has high potential to spark the fourth industrial revolution. Notable examples can be found in the sector of road transportation, where AI has drastically changed automobile design and traffic management. Many new technologies, such as driver assistance, autonomous driving, and cloud-based cooperation, are emerging at an unbelievable speed. These new technologies have the potential to significantly improve driving ability, reduce traffic accidents, and relieve urban congestion. As one of the most important AI branches, reinforcement learning (RL) has attracted increasing attention in recent years. RL is an interdisciplinary field of trial-and-error learning and optimal contro…  ( 1 min )
    Announcing the First Annual RL Conference!
    About RLC The first Reinforcement Learning Conference (RLC) will be held in Amherst, Massachusetts from August 9–12, 2024. RLC is an annual international conference focusing on reinforcement learning. RLC provides an archival venue where reinforcement learning researchers can interact and share their research in a more focused setting than typical large machine learning venues. The RLC peer review process prioritizes rigorous methodology over perceived importance, aiming to foster scholarly discussions on both well-established and emerging topics in RL. https://rl-conference.cc/ Submission deadline: March 1 2024 Advisory Committee Peter Stone (UT Austin) Satinder Singh (University of Michigan) Emma Brunskill (Stanford) Michael Littman (Brown) Shie Mannor (Technion, NVIDIA) Michael Bowling (U Alberta) Sergey Levine (UC Berkeley) Balaraman Ravindran (IIT Madras) Sham Kakade (Harvard) Benjamin Rosman (University of the Witwatersrand) Marc Deisenroth (UCL) Andrew Barto (UMass Amherst) submitted by /u/ShrekLovesYouBack [link] [comments]  ( 1 min )
    How to create an expert for Imitation Learning ?
    Hi, So I'm using the poses that are captured from a pose estimator (mediapipe) and want to use this to train my humanoid model. I'm planning on using imitation learning for this and I'm not sure how to create the expert in this case. Can someone please enlighten me how to do this?? ​ A little about the project: I plan on using this to train a humanoid to walk. hence plan on mapping this to an expert and than train the humanoid to walk based on how the expert walk. I have seen people teach a humanoid to walk using PPO or some other RL and then use that as the expert and train the other using imitation learning where the PPO trained humanoid acts as the expert. submitted by /u/rakk109 [link] [comments]  ( 1 min )
  • Open

    Implement a custom AutoML job using pre-selected algorithms in Amazon SageMaker Automatic Model Tuning
    AutoML allows you to derive rapid, general insights from your data right at the beginning of a machine learning (ML) project lifecycle. Understanding up front which preprocessing techniques and algorithm types provide best results reduces the time to develop, train, and deploy the right model. It plays a crucial role in every model’s development process […]  ( 14 min )
    Best prompting practices for using the Llama 2 Chat LLM through Amazon SageMaker JumpStart
    Llama 2 stands at the forefront of AI innovation, embodying an advanced auto-regressive language model developed on a sophisticated transformer foundation. It’s tailored to address a multitude of applications in both the commercial and research domains with English as the primary linguistic concentration. Its model parameters scale from an impressive 7 billion to a remarkable […]  ( 18 min )
    Principal Financial Group uses AWS Post Call Analytics solution to extract omnichannel customer insights
    An established financial services firm with over 140 years in business, Principal is a global investment management leader and serves more than 62 million customers around the world. Principal is conducting enterprise-scale near-real-time analytics to deliver a seamless and hyper-personalized omnichannel customer experience on their mission to make financial security accessible for all. They are […]  ( 10 min )
    Foundational vision models and visual prompt engineering for autonomous driving applications
    Prompt engineering has become an essential skill for anyone working with large language models (LLMs) to generate high-quality and relevant texts. Although text prompt engineering has been widely discussed, visual prompt engineering is an emerging field that requires attention. Visual prompts can include bounding boxes or masks that guide vision models in generating relevant and […]  ( 13 min )
  • Open

    Ringing in the Future: NVIDIA and Amdocs Bring Custom Generative AI to Global Telco Industry
    The telecommunications industry — the backbone of today’s interconnected world — is valued at a staggering $1.7 trillion globally, according to IDC. It’s a massive operation, as telcos process hundreds of petabytes of data in their networks each day. That magnitude is only increasing, as the total amount of data transacted globally is forecast to Read article >  ( 6 min )
    In the Fast Lane: NVIDIA Announces Omniverse Cloud Services on Microsoft Azure to Accelerate Automotive Digitalization
    Automotive companies are transforming every phase of their product lifecycle — evolving their primarily physical, manual processes into software-driven, AI-enhanced digital systems. To help them save costs and reduce lead times, NVIDIA is announcing two new simulation engines on Omniverse Cloud: the virtual factory simulation engine and the autonomous vehicle (AV) simulation engine. Omniverse Cloud, Read article >  ( 6 min )
    New NVIDIA H100, H200 Tensor Core GPU Instances Coming to Microsoft Azure to Accelerate AI Workloads
    As NVIDIA continues to collaborate with Microsoft to build state-of-the-art AI infrastructure, Microsoft is introducing additional H100-based virtual machines to Microsoft Azure to accelerate demanding AI workloads. At its Ignite conference in Seattle today, Microsoft announced its new NC H100 v5 VM series for Azure, the industry’s first cloud instances featuring NVIDIA H100 NVL GPUs. Read article >  ( 5 min )
    NVIDIA Fast-Tracks Custom Generative AI Model Development for Enterprises
    Today’s landscape of free, open-source large language models (LLMs) is like an all-you-can-eat buffet for enterprises. This abundance can be overwhelming for developers building custom generative AI applications, as they need to navigate unique project and business requirements, including compatibility, security and the data used to train the models. NVIDIA AI Foundation Models — a Read article >  ( 5 min )
    What Is Retrieval-Augmented Generation?
    To understand the latest advance in generative AI, imagine a courtroom. Judges hear and decide cases based on their general understanding of the law. Sometimes a case — like a malpractice suit or a labor dispute —  requires special expertise, so judges send court clerks to a law library, looking for precedents and specific cases Read article >  ( 9 min )
    Igniting the Future: TensorRT-LLM Release Accelerates AI Inference Performance, Adds Support for New Models Running on RTX-Powered Windows 11 PCs
    Artificial intelligence on Windows 11 PCs marks a pivotal moment in tech history, revolutionizing experiences for gamers, creators, streamers, office workers, students and even casual PC users. It offers unprecedented opportunities to enhance productivity for users of the more than 100 million Windows PCs and workstations that are powered by RTX GPUs. And NVIDIA RTX Read article >  ( 7 min )
  • Open

    This 3D printer can watch itself fabricate objects
    Computer vision enables contact-free 3D printing, letting engineers print with high-performance materials they couldn’t use before.  ( 11 min )
  • Open

    De-SaTE: Denoising Self-attention Transformer Encoders for Li-ion Battery Health Prognostics. (arXiv:2310.00023v2 [cs.LG] UPDATED)
    The usage of Lithium-ion (Li-ion) batteries has gained widespread popularity across various industries, from powering portable electronic devices to propelling electric vehicles and supporting energy storage systems. A central challenge in Li-ion battery reliability lies in accurately predicting their Remaining Useful Life (RUL), which is a critical measure for proactive maintenance and predictive analytics. This study presents a novel approach that harnesses the power of multiple denoising modules, each trained to address specific types of noise commonly encountered in battery data. Specifically, a denoising auto-encoder and a wavelet denoiser are used to generate encoded/decomposed representations, which are subsequently processed through dedicated self-attention transformer encoders. After extensive experimentation on NASA and CALCE data, a broad spectrum of health indicator values are estimated under a set of diverse noise patterns. The reported error metrics on these data are on par with or better than the state-of-the-art reported in recent literature.  ( 2 min )
    Learning to Generate Better Than Your LLM. (arXiv:2306.11816v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning Large Language Models (LLMs) for text generation. In particular, recent LLMs such as ChatGPT and GPT-4 can engage in fluent conversations with users after finetuning with RL. Capitalizing on key properties of text generation, we seek to investigate RL algorithms beyond general purpose algorithms like Proximal Policy Optimization (PPO). In particular, we extend RL algorithms to allow them to interact with a dynamic black-box guide LLM and propose RL with guided feedback (RLGF), a suite of RL algorithms for LLM fine-tuning. We provide two ways for the guide LLM to interact with the LLM to be optimized for maximizing rewards. The guide LLM can generate text which serves as additional starting states for the RL optimization procedure. The guide LLM can also be used to complete the partial sentences generated by the LLM that is being optimized, treating the guide LLM as an expert to imitate and surpass eventually. We experiment on the IMDB positive sentiment, CommonGen, and TL;DR summarization tasks. We show that our RL algorithms achieve higher performance than supervised learning (SL) and the RL baseline PPO, demonstrating the benefit of interaction with the guide LLM. On both CommonGen and TL;DR, we not only outperform our SL baselines but also improve upon PPO across a variety of metrics beyond the one we optimized for. Our code can be found at https://github.com/Cornell-RL/tril.  ( 3 min )
    Unleashing Realistic Air Quality Forecasting: Introducing the Ready-to-Use PurpleAirSF Dataset. (arXiv:2306.13948v2 [cs.LG] UPDATED)
    Air quality forecasting has garnered significant attention recently, with data-driven models taking center stage due to advancements in machine learning and deep learning models. However, researchers face challenges with complex data acquisition and the lack of open-sourced datasets, hindering efficient model validation. This paper introduces PurpleAirSF, a comprehensive and easily accessible dataset collected from the PurpleAir network. With its high temporal resolution, various air quality measures, and diverse geographical coverage, this dataset serves as a useful tool for researchers aiming to develop novel forecasting models, study air pollution patterns, and investigate their impacts on health and the environment. We present a detailed account of the data collection and processing methods employed to build PurpleAirSF. Furthermore, we conduct preliminary experiments using both classic and modern spatio-temporal forecasting models, thereby establishing a benchmark for future air quality forecasting tasks.  ( 2 min )
    A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect Dataset. (arXiv:2307.10455v3 [cs.CV] UPDATED)
    In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is taxonomically classified by an expert, and also has associated genetic information including raw nucleotide barcode sequences and assigned barcode index numbers, which are genetically-based proxies for species classification. This paper presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment, however, the dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community. Driven by the biological nature inherent to the dataset, a characteristic long-tailed class-imbalance distribution is exhibited. Furthermore, taxonomic labelling is a hierarchical classification scheme, presenting a highly fine-grained classification problem at lower levels. Beyond spurring interest in biodiversity research within the machine learning community, progress on creating an image-based taxonomic classifier will also further the ultimate goal of all BIOSCAN research: to lay the foundation for a comprehensive survey of global biodiversity. This paper introduces the dataset and explores the classification task through the implementation and analysis of a baseline classifier.  ( 3 min )
    Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms. (arXiv:2310.10418v2 [cs.LG] UPDATED)
    Commonsense norms are defeasible by context: reading books is usually great, but not when driving a car. While contexts can be explicitly described in language, in embodied scenarios, contexts are often provided visually. This type of visually grounded reasoning about defeasible commonsense norms is generally easy for humans, but (as we show) poses a challenge for machines, as it necessitates both visual understanding and reasoning about commonsense norms. We construct a new multimodal benchmark for studying visual-grounded commonsense norms: NORMLENS. NORMLENS consists of 10K human judgments accompanied by free-form explanations covering 2K multimodal situations, and serves as a probe to address two questions: (1) to what extent can models align with average human judgment? and (2) how well can models explain their predicted judgments? We find that state-of-the-art model judgments and explanations are not well-aligned with human annotation. Additionally, we present a new approach to better align models with humans by distilling social commonsense knowledge from large language models. The data and code are released at https://seungjuhan.me/normlens.  ( 3 min )
    Understanding Generalization via Set Theory. (arXiv:2311.06545v1 [cs.LG])
    Generalization is at the core of machine learning models. However, the definition of generalization is not entirely clear. We employ set theory to introduce the concepts of algorithms, hypotheses, and dataset generalization. We analyze the properties of dataset generalization and prove a theorem on surrogate generalization procedures. This theorem leads to our generalization method. Through a generalization experiment on the MNIST dataset, we obtain 13,541 sample bases. When we use the entire training set to evaluate the model's performance, the models achieve an accuracy of 99.945%. However, if we shift the sample bases or modify the neural network structure, the performance experiences a significant decline. We also identify consistently mispredicted samples and find that they are all challenging examples. The experiments substantiated the accuracy of the generalization definition and the effectiveness of the proposed methods. Both the set-theoretic deduction and the experiments help us better understand generalization.  ( 2 min )
    Quantitatively Measuring and Contrastively Exploring Heterogeneity for Domain Generalization. (arXiv:2305.15889v3 [cs.LG] UPDATED)
    Domain generalization (DG) is a prevalent problem in real-world applications, which aims to train well-generalized models for unseen target domains by utilizing several source domains. Since domain labels, i.e., which domain each data point is sampled from, naturally exist, most DG algorithms treat them as a kind of supervision information to improve the generalization performance. However, the original domain labels may not be the optimal supervision signal due to the lack of domain heterogeneity, i.e., the diversity among domains. For example, a sample in one domain may be closer to another domain, its original label thus can be the noise to disturb the generalization learning. Although some methods try to solve it by re-dividing domains and applying the newly generated dividing pattern, the pattern they choose may not be the most heterogeneous due to the lack of the metric for heterogeneity. In this paper, we point out that domain heterogeneity mainly lies in variant features under the invariant learning framework. With contrastive learning, we propose a learning potential-guided metric for domain heterogeneity by promoting learning variant features. Then we notice the differences between seeking variance-based heterogeneity and training invariance-based generalizable model. We thus propose a novel method called Heterogeneity-based Two-stage Contrastive Learning (HTCL) for the DG task. In the first stage, we generate the most heterogeneous dividing pattern with our contrastive metric. In the second stage, we employ an invariance-aimed contrastive learning by re-building pairs with the stable relation hinted by domains and classes, which better utilizes generated domain labels for generalization learning. Extensive experiments show HTCL better digs heterogeneity and yields great generalization performance.  ( 3 min )
    FMG-Net and W-Net: Multigrid Inspired Deep Learning Architectures For Medical Imaging Segmentation. (arXiv:2304.02725v3 [eess.IV] UPDATED)
    Accurate medical imaging segmentation is critical for precise and effective medical interventions. However, despite the success of convolutional neural networks (CNNs) in medical image segmentation, they still face challenges in handling fine-scale features and variations in image scales. These challenges are particularly evident in complex and challenging segmentation tasks, such as the BraTS multi-label brain tumor segmentation challenge. In this task, accurately segmenting the various tumor sub-components, which vary significantly in size and shape, remains a significant challenge, with even state-of-the-art methods producing substantial errors. Therefore, we propose two architectures, FMG-Net and W-Net, that incorporate the principles of geometric multigrid methods for solving linear systems of equations into CNNs to address these challenges. Our experiments on the BraTS 2020 dataset demonstrate that both FMG-Net and W-Net outperform the widely used U-Net architecture regarding tumor subcomponent segmentation accuracy and training efficiency. These findings highlight the potential of incorporating the principles of multigrid methods into CNNs to improve the accuracy and efficiency of medical imaging segmentation.  ( 2 min )
    Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models. (arXiv:2310.03059v2 [cs.CV] UPDATED)
    The popularity of pre-trained large models has revolutionized downstream tasks across diverse fields, such as language, vision, and multi-modality. To minimize the adaption cost for downstream tasks, many Parameter-Efficient Fine-Tuning (PEFT) techniques are proposed for language and 2D image pre-trained models. However, the specialized PEFT method for 3D pre-trained models is still under-explored. To this end, we introduce Point-PEFT, a novel framework for adapting point cloud pre-trained models with minimal learnable parameters. Specifically, for a pre-trained 3D model, we freeze most of its parameters, and only tune the newly added PEFT modules on downstream tasks, which consist of a Point-prior Prompt and a Geometry-aware Adapter. The Point-prior Prompt adopts a set of learnable prompt tokens, for which we propose to construct a memory bank with domain-specific knowledge, and utilize a parameter-free attention to enhance the prompt tokens. The Geometry-aware Adapter aims to aggregate point cloud features within spatial neighborhoods to capture fine-grained geometric information through local interactions. Extensive experiments indicate that our Point-PEFT can achieve better performance than the full fine-tuning on various downstream tasks, while using only 5% of the trainable parameters, demonstrating the efficiency and effectiveness of our approach. Code will be released at https://github.com/Even-JK/PEFT-3D.  ( 2 min )
    A Survey of AI Text-to-Image and AI Text-to-Video Generators. (arXiv:2311.06329v1 [cs.CV])
    Text-to-Image and Text-to-Video AI generation models are revolutionary technologies that use deep learning and natural language processing (NLP) techniques to create images and videos from textual descriptions. This paper investigates cutting-edge approaches in the discipline of Text-to-Image and Text-to-Video AI generations. The survey provides an overview of the existing literature as well as an analysis of the approaches used in various studies. It covers data preprocessing techniques, neural network types, and evaluation metrics used in the field. In addition, the paper discusses the challenges and limitations of Text-to-Image and Text-to-Video AI generations, as well as future research directions. Overall, these models have promising potential for a wide range of applications such as video production, content creation, and digital marketing.  ( 2 min )
    Fast Machine Learning Method with Vector Embedding on Orthonormal Basis and Spectral Transform. (arXiv:2310.18424v2 [cs.LG] UPDATED)
    This paper presents a novel fast machine learning method that leverages two techniques: Vector Embedding on Orthonormal Basis (VEOB) and Spectral Transform (ST). The VEOB converts the original data encoding into a vector embedding with coordinates projected onto orthonormal bases. The Singular Value Decomposition (SVD) technique is used to calculate the vector basis and projection coordinates, leading to an enhanced distance measurement in the embedding space and facilitating data compression by preserving the projection vectors associated with the largest singular values. On the other hand, ST transforms sequence of vector data into spectral space. By applying the Discrete Cosine Transform (DCT) and selecting the most significant components, it streamlines the handling of lengthy vector sequences. The paper provides examples of word embedding, text chunk embedding, and image embedding, implemented in Julia language with a vector database. It also investigates unsupervised learning and supervised learning using this method, along with strategies for handling large data volumes.  ( 2 min )
    Time Travel in LLMs: Tracing Data Contamination in Large Language Models. (arXiv:2308.08493v2 [cs.CL] CROSS LISTED)
    Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in measuring LLMs' real effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination at the instance level; using this information, our approach then assesses wider contamination at the partition level. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the random-length initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or nearly matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE-L or BLEURT) is statistically significantly better with the completions from guided instruction compared to a "general instruction" that does not include the dataset and partition name. The second idea marks a dataset partition as contaminated if a classifier based on GPT-4 with few-shot in-context learning prompt marks multiple generated completions as exact/near-exact matches of the corresponding reference instances. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human experts. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.  ( 3 min )
    A Robust Deep Learning Method with Uncertainty Estimation for the Pathological Classification of Renal Cell Carcinoma based on CT Images. (arXiv:2311.00567v2 [eess.IV] UPDATED)
    Objectives To develop and validate a deep learning-based diagnostic model incorporating uncertainty estimation so as to facilitate radiologists in the preoperative differentiation of the pathological subtypes of renal cell carcinoma (RCC) based on CT images. Methods Data from 668 consecutive patients, pathologically proven RCC, were retrospectively collected from Center 1. By using five-fold cross-validation, a deep learning model incorporating uncertainty estimation was developed to classify RCC subtypes into clear cell RCC (ccRCC), papillary RCC (pRCC), and chromophobe RCC (chRCC). An external validation set of 78 patients from Center 2 further evaluated the model's performance. Results In the five-fold cross-validation, the model's area under the receiver operating characteristic curve (AUC) for the classification of ccRCC, pRCC, and chRCC was 0.868 (95% CI: 0.826-0.923), 0.846 (95% CI: 0.812-0.886), and 0.839 (95% CI: 0.802-0.88), respectively. In the external validation set, the AUCs were 0.856 (95% CI: 0.838-0.882), 0.787 (95% CI: 0.757-0.818), and 0.793 (95% CI: 0.758-0.831) for ccRCC, pRCC, and chRCC, respectively. Conclusions The developed deep learning model demonstrated robust performance in predicting the pathological subtypes of RCC, while the incorporated uncertainty emphasized the importance of understanding model confidence, which is crucial for assisting clinical decision-making for patients with renal tumors. Clinical relevance statement Our deep learning approach, integrated with uncertainty estimation, offers clinicians a dual advantage: accurate RCC subtype predictions complemented by diagnostic confidence references, promoting informed decision-making for patients with RCC.  ( 3 min )
    Addressing Data Scarcity in Optical Matrix Multiplier Modeling Using Transfer Learning. (arXiv:2308.11630v2 [cs.LG] UPDATED)
    We present and experimentally evaluate using transfer learning to address experimental data scarcity when training neural network (NN) models for Mach-Zehnder interferometer mesh-based optical matrix multipliers. Our approach involves pre-training the model using synthetic data generated from a less accurate analytical model and fine-tuning with experimental data. Our investigation demonstrates that this method yields significant reductions in modeling errors compared to using an analytical model, or a standalone NN model when training data is limited. Utilizing regularization techniques and ensemble averaging, we achieve < 1 dB root-mean-square error on the matrix weights implemented by a 3x3 photonic chip while using only 25% of the available data.  ( 2 min )
    RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation. (arXiv:2311.01455v2 [cs.RO] UPDATED)
    We present RoboGen, a generative robotic agent that automatically learns diverse robotic skills at scale via generative simulation. RoboGen leverages the latest advancements in foundation and generative models. Instead of directly using or adapting these models to produce policies or low-level actions, we advocate for a generative scheme, which uses these models to automatically generate diversified tasks, scenes, and training supervisions, thereby scaling up robotic skill learning with minimal human supervision. Our approach equips a robotic agent with a self-guided propose-generate-learn cycle: the agent first proposes interesting tasks and skills to develop, and then generates corresponding simulation environments by populating pertinent objects and assets with proper spatial configurations. Afterwards, the agent decomposes the proposed high-level task into sub-tasks, selects the optimal learning approach (reinforcement learning, motion planning, or trajectory optimization), generates required training supervision, and then learns policies to acquire the proposed skill. Our work attempts to extract the extensive and versatile knowledge embedded in large-scale models and transfer them to the field of robotics. Our fully generative pipeline can be queried repeatedly, producing an endless stream of skill demonstrations associated with diverse tasks and environments.  ( 2 min )
    CeBed: A Benchmark for Deep Data-Driven OFDM Channel Estimation. (arXiv:2306.13761v2 [cs.AI] UPDATED)
    Deep learning has been extensively used in wireless communication problems, including channel estimation. Although several data-driven approaches exist, a fair and realistic comparison between them is difficult due to inconsistencies in the experimental conditions and the lack of a standardized experimental design. In addition, the performance of data-driven approaches is often compared based on empirical analysis. The lack of reproducibility and availability of standardized evaluation tools (e.g., datasets, codebases) hinder the development and progress of data-driven methods for channel estimation and wireless communication in general. In this work, we introduce an initiative to build benchmarks that unify several data-driven OFDM channel estimation approaches. Specifically, we present CeBed (a testbed for channel estimation) including different datasets covering various systems models and propagation conditions along with the implementation of ten deep and traditional baselines. This benchmark considers different practical aspects such as the robustness of the data-driven models, the number and the arrangement of pilots, and the number of receive antennas. This work offers a comprehensive and unified framework to help researchers evaluate and design data-driven channel estimation algorithms.  ( 2 min )
    Price-Aware Deep Learning for Electricity Markets. (arXiv:2308.01436v2 [cs.LG] UPDATED)
    While deep learning gradually penetrates operational planning, its inherent prediction errors may significantly affect electricity prices. This letter examines how prediction errors propagate into electricity prices, revealing notable pricing errors and their spatial disparity in congested power systems. To improve fairness, we propose to embed electricity market-clearing optimization as a deep learning layer. Differentiating through this layer allows for balancing between prediction and pricing errors, as oppose to minimizing prediction errors alone. This layer implicitly optimizes fairness and controls the spatial distribution of price errors across the system. We showcase the price-aware deep learning in the nexus of wind power forecasting and short-term electricity market clearing.  ( 2 min )
    Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach. (arXiv:2305.18427v3 [cs.LG] UPDATED)
    A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed. Reward redistribution serves as a solution to re-assign credits for each time step from observed sequences. While the majority of current approaches construct the reward redistribution in an uninterpretable manner, we propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution and preserving policy invariance. In this paper, we start by studying the role of causal generative models in reward redistribution by characterizing the generation of Markovian rewards and trajectory-wise long-term return and further propose a framework, called Generative Return Decomposition (GRD), for policy optimization in delayed reward scenarios. Specifically, GRD first identifies the unobservable Markovian rewards and causal relations in the generative process. Then, GRD makes use of the identified causal generative model to form a compact representation to train policy over the most favorable subspace of the state space of the agent. Theoretically, we show that the unobservable Markovian reward function is identifiable, as well as the underlying causal structure and causal models. Experimental results show that our method outperforms state-of-the-art methods and the provided visualization further demonstrates the interpretability of our method. The project page is located at https://reedzyd.github.io/GenerativeReturnDecomposition/.
    SneakyPrompt: Jailbreaking Text-to-image Generative Models. (arXiv:2305.12082v3 [cs.LG] UPDATED)
    Text-to-image generative models such as Stable Diffusion and DALL$\cdot$E raise many ethical concerns due to the generation of harmful images such as Not-Safe-for-Work (NSFW) ones. To address these ethical concerns, safety filters are often adopted to prevent the generation of NSFW images. In this work, we propose SneakyPrompt, the first automated attack framework, to jailbreak text-to-image generative models such that they generate NSFW images even if safety filters are adopted. Given a prompt that is blocked by a safety filter, SneakyPrompt repeatedly queries the text-to-image generative model and strategically perturbs tokens in the prompt based on the query results to bypass the safety filter. Specifically, SneakyPrompt utilizes reinforcement learning to guide the perturbation of tokens. Our evaluation shows that SneakyPrompt successfully jailbreaks DALL$\cdot$E 2 with closed-box safety filters to generate NSFW images. Moreover, we also deploy several state-of-the-art, open-source safety filters on a Stable Diffusion model. Our evaluation shows that SneakyPrompt not only successfully generates NSFW images, but also outperforms existing text adversarial attacks when extended to jailbreak text-to-image generative models, in terms of both the number of queries and qualities of the generated NSFW images. SneakyPrompt is open-source and available at this repository: \url{https://github.com/Yuchen413/text2image_safety}.
    SLiMe: Segment Like Me. (arXiv:2309.03179v3 [cs.CV] UPDATED)
    Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel "weighted accumulated self-attention map" from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the attention maps, which in turn can then be used to derive the segmentation map. This enables SLiMe to segment any real-world image during inference with the granularity of the segmented region in the training image, using just one example. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We carried out a knowledge-rich set of experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.
    Game Theory Solutions in Sensor-Based Human Activity Recognition: A Review. (arXiv:2311.06311v1 [cs.GT])
    The Human Activity Recognition (HAR) tasks automatically identify human activities using the sensor data, which has numerous applications in healthcare, sports, security, and human-computer interaction. Despite significant advances in HAR, critical challenges still exist. Game theory has emerged as a promising solution to address these challenges in machine learning problems including HAR. However, there is a lack of research work on applying game theory solutions to the HAR problems. This review paper explores the potential of game theory as a solution for HAR tasks, and bridges the gap between game theory and HAR research work by suggesting novel game-theoretic approaches for HAR problems. The contributions of this work include exploring how game theory can improve the accuracy and robustness of HAR models, investigating how game-theoretic concepts can optimize recognition algorithms, and discussing the game-theoretic approaches against the existing HAR methods. The objective is to provide insights into the potential of game theory as a solution for sensor-based HAR, and contribute to develop a more accurate and efficient recognition system in the future research directions.
    Stain Consistency Learning: Handling Stain Variation for Automatic Digital Pathology Segmentation. (arXiv:2311.06552v1 [eess.IV])
    Stain variation is a unique challenge associated with automated analysis of digital pathology. Numerous methods have been developed to improve the robustness of machine learning methods to stain variation, but comparative studies have demonstrated limited benefits to performance. Moreover, methods to handle stain variation were largely developed for H&E stained data, with evaluation generally limited to classification tasks. Here we propose Stain Consistency Learning, a novel framework combining stain-specific augmentation with a stain consistency loss function to learn stain colour invariant features. We perform the first, extensive comparison of methods to handle stain variation for segmentation tasks, comparing ten methods on Masson's trichrome and H&E stained cell and nuclei datasets, respectively. We observed that stain normalisation methods resulted in equivalent or worse performance, while stain augmentation or stain adversarial methods demonstrated improved performance, with the best performance consistently achieved by our proposed approach. The code is available at: https://github.com/mlyg/stain_consistency_learning
    Higher-Order Newton Methods with Polynomial Work per Iteration. (arXiv:2311.06374v1 [math.OC])
    We present generalizations of Newton's method that incorporate derivatives of an arbitrary order $d$ but maintain a polynomial dependence on dimension in their cost per iteration. At each step, our $d^{\text{th}}$-order method uses semidefinite programming to construct and minimize a sum of squares-convex approximation to the $d^{\text{th}}$-order Taylor expansion of the function we wish to minimize. We prove that our $d^{\text{th}}$-order method has local convergence of order $d$. This results in lower oracle complexity compared to the classical Newton method. We show on numerical examples that basins of attraction around local minima can get larger as $d$ increases. Under additional assumptions, we present a modified algorithm, again with polynomial cost per iteration, which is globally convergent and has local convergence of order $d$.
    Advancing Parsimonious Deep Learning Weather Prediction using the HEALPix Mesh. (arXiv:2311.06253v1 [physics.ao-ph])
    We present a parsimonious deep learning weather prediction model on the Hierarchical Equal Area isoLatitude Pixelization (HEALPix) to forecast seven atmospheric variables for arbitrarily long lead times on a global approximately 110 km mesh at 3h time resolution. In comparison to state-of-the-art machine learning weather forecast models, such as Pangu-Weather and GraphCast, our DLWP-HPX model uses coarser resolution and far fewer prognostic variables. Yet, at one-week lead times its skill is only about one day behind the state-of-the-art numerical weather prediction model from the European Centre for Medium-Range Weather Forecasts. We report successive forecast improvements resulting from model design and data-related decisions, such as switching from the cubed sphere to the HEALPix mesh, inverting the channel depth of the U-Net, and introducing gated recurrent units (GRU) on each level of the U-Net hierarchy. The consistent east-west orientation of all cells on the HEALPix mesh facilitates the development of location-invariant convolution kernels that are successfully applied to propagate global weather patterns across our planet. Without any loss of spectral power after two days, the model can be unrolled autoregressively for hundreds of steps into the future to generate stable and realistic states of the atmosphere that respect seasonal trends, as showcased in one-year simulations. Our parsimonious DLWP-HPX model is research-friendly and potentially well-suited for sub-seasonal and seasonal forecasting.
    Flatness-aware Adversarial Attack. (arXiv:2311.06423v1 [cs.LG])
    The transferability of adversarial examples can be exploited to launch black-box attacks. However, adversarial examples often present poor transferability. To alleviate this issue, by observing that the diversity of inputs can boost transferability, input regularization based methods are proposed, which craft adversarial examples by combining several transformed inputs. We reveal that input regularization based methods make resultant adversarial examples biased towards flat extreme regions. Inspired by this, we propose an attack called flatness-aware adversarial attack (FAA) which explicitly adds a flatness-aware regularization term in the optimization target to promote the resultant adversarial examples towards flat extreme regions. The flatness-aware regularization term involves gradients of samples around the resultant adversarial examples but optimizing gradients requires the evaluation of Hessian matrix in high-dimension spaces which generally is intractable. To address the problem, we derive an approximate solution to circumvent the construction of Hessian matrix, thereby making FAA practical and cheap. Extensive experiments show the transferability of adversarial examples crafted by FAA can be considerably boosted compared with state-of-the-art baselines.
    ChaTA: Towards an Intelligent Question-Answer Teaching Assistant using Open-Source LLMs. (arXiv:2311.02775v2 [cs.LG] UPDATED)
    Responding to the thousands of student questions on online QA platforms each semester has a considerable human cost, particularly in computing courses with rapidly growing enrollments. To address the challenges of scalable and intelligent question-answering (QA), we introduce an innovative solution that leverages open-source Large Language Models (LLMs) from the LLaMA-2 family to ensure data privacy. Our approach combines augmentation techniques such as retrieval augmented generation (RAG), supervised fine-tuning (SFT), and learning from human preferences data using Direct Preference Optimization (DPO). Through extensive experimentation on a Piazza dataset from an introductory CS course, comprising 10,000 QA pairs and 1,500 pairs of preference data, we demonstrate a significant 30% improvement in the quality of answers, with RAG being a particularly impactful addition. Our contributions include the development of a novel architecture for educational QA, extensive evaluations of LLM performance utilizing both human assessments and LLM-based metrics, and insights into the challenges and future directions of educational data processing. This work paves the way for the development of CHATA, an intelligent QA assistant customizable for courses with an online QA platform
    Deep Ridgelet Transform: Voice with Koopman Operator Proves Universality of Formal Deep Networks. (arXiv:2310.03529v2 [cs.LG] UPDATED)
    We identify hidden layers inside a deep neural network (DNN) with group actions on the data domain, and formulate a formal deep network as a dual voice transform with respect to the Koopman operator, a linear representation of the group action. Based on the group theoretic arguments, particularly by using Schur's lemma, we show a simple proof of the universality of DNNs.
    PREM: A Simple Yet Effective Approach for Node-Level Graph Anomaly Detection. (arXiv:2310.11676v2 [cs.LG] UPDATED)
    Node-level graph anomaly detection (GAD) plays a critical role in identifying anomalous nodes from graph-structured data in various domains such as medicine, social networks, and e-commerce. However, challenges have arisen due to the diversity of anomalies and the dearth of labeled data. Existing methodologies - reconstruction-based and contrastive learning - while effective, often suffer from efficiency issues, stemming from their complex objectives and elaborate modules. To improve the efficiency of GAD, we introduce a simple method termed PREprocessing and Matching (PREM for short). Our approach streamlines GAD, reducing time and memory consumption while maintaining powerful anomaly detection capabilities. Comprising two modules - a pre-processing module and an ego-neighbor matching module - PREM eliminates the necessity for message-passing propagation during training, and employs a simple contrastive loss, leading to considerable reductions in training time and memory usage. Moreover, through rigorous evaluations of five real-world datasets, our method demonstrated robustness and effectiveness. Notably, when validated on the ACM dataset, PREM achieved a 5% improvement in AUC, a 9-fold increase in training speed, and sharply reduce memory usage compared to the most efficient baseline.
    TATA: Stance Detection via Topic-Agnostic and Topic-Aware Embeddings. (arXiv:2310.14450v2 [cs.CL] UPDATED)
    Stance detection is important for understanding different attitudes and beliefs on the Internet. However, given that a passage's stance toward a given topic is often highly dependent on that topic, building a stance detection model that generalizes to unseen topics is difficult. In this work, we propose using contrastive learning as well as an unlabeled dataset of news articles that cover a variety of different topics to train topic-agnostic/TAG and topic-aware/TAW embeddings for use in downstream stance detection. Combining these embeddings in our full TATA model, we achieve state-of-the-art performance across several public stance detection datasets (0.771 $F_1$-score on the Zero-shot VAST dataset). We release our code and data at https://github.com/hanshanley/tata.
    Integrating Pre-trained Language Model into Neural Machine Translation. (arXiv:2310.19680v2 [cs.CL] UPDATED)
    Neural Machine Translation (NMT) has become a significant technology in natural language processing through extensive research and development. However, the deficiency of high-quality bilingual language pair data still poses a major challenge to improving NMT performance. Recent studies are exploring the use of contextual information from pre-trained language model (PLM) to address this problem. Yet, the issue of incompatibility between PLM and NMT model remains unresolved. This study proposes a PLM-integrated NMT (PiNMT) model to overcome the identified problems. The PiNMT model consists of three critical components, PLM Multi Layer Converter, Embedding Fusion, and Cosine Alignment, each playing a vital role in providing effective PLM information to NMT. Furthermore, two training strategies, Separate Learning Rates and Dual Step Training, are also introduced in this paper. By implementing the proposed PiNMT model and training strategy, we achieved state-of-the-art performance on the IWSLT'14 En$\leftrightarrow$De dataset. This study's outcomes are noteworthy as they demonstrate a novel approach for efficiently integrating PLM with NMT to overcome incompatibility and enhance performance.
    Towards the Fundamental Limits of Knowledge Transfer over Finite Domains. (arXiv:2310.07838v3 [cs.LG] UPDATED)
    We characterize the statistical efficiency of knowledge transfer through $n$ samples from a teacher to a probabilistic student classifier with input space $\mathcal S$ over labels $\mathcal A$. We show that privileged information at three progressive levels accelerates the transfer. At the first level, only samples with hard labels are known, via which the maximum likelihood estimator attains the minimax rate $\sqrt{{|{\mathcal S}||{\mathcal A}|}/{n}}$. The second level has the teacher probabilities of sampled labels available in addition, which turns out to boost the convergence rate lower bound to ${{|{\mathcal S}||{\mathcal A}|}/{n}}$. However, under this second data acquisition protocol, minimizing a naive adaptation of the cross-entropy loss results in an asymptotically biased student. We overcome this limitation and achieve the fundamental limit by using a novel empirical variant of the squared error logit loss. The third level further equips the student with the soft labels (complete logits) on ${\mathcal A}$ given every sampled input, thereby provably enables the student to enjoy a rate ${|{\mathcal S}|}/{n}$ free of $|{\mathcal A}|$. We find any Kullback-Leibler divergence minimizer to be optimal in the last case. Numerical simulations distinguish the four learners and corroborate our theory.
    Graph Meets LLMs: Towards Large Graph Models. (arXiv:2308.14522v2 [cs.LG] UPDATED)
    Large models have emerged as the most recent groundbreaking achievements in artificial intelligence, and particularly machine learning. However, when it comes to graphs, large models have not achieved the same level of success as in other fields, such as natural language processing and computer vision. In order to promote applying large models for graphs forward, we present a perspective paper to discuss the challenges and opportunities associated with developing large graph models. First, we discuss the desired characteristics of large graph models. Then, we present detailed discussions from three key perspectives: representation basis, graph data, and graph models. In each category, we provide a brief overview of recent advances and highlight the remaining challenges together with our visions. Finally, we discuss valuable applications of large graph models. We believe this perspective can encourage further investigations into large graph models, ultimately pushing us one step closer towards artificial general intelligence (AGI). We are the first to comprehensively study large graph models, to the best of our knowledge.
    Activation Addition: Steering Language Models Without Optimization. (arXiv:2308.10248v3 [cs.CL] UPDATED)
    Reliably controlling the behavior of large language models is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback, prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference-time to predictably alter model behavior. We bias the forward pass with a 'steering vector' implicitly specified through natural language. Past work learned these steering vectors; our Activation Addition (ActAdd) method instead computes them by taking the activation differences which result from pairs of prompts. We demonstrate ActAdd on GPT-2 on OpenWebText and ConceptNet, and replicate the effect on Llama-13B and GPT-J-6B. Our approach yields inference-time control over high-level properties of output & preserves performance on off-target topics. The method requires far less compute and implementation effort than finetuning and RLHF, allows for natural language specification by users, and its overhead scales naturally with model size.
    Separating the Wheat from the Chaff with BREAD: An open-source benchmark and metrics to detect redundancy in text. (arXiv:2311.06440v1 [cs.CL])
    Data quality is a problem that perpetually resurfaces throughout the field of NLP, regardless of task, domain, or architecture, and remains especially severe for lower-resource languages. A typical and insidious issue, affecting both training data and model output, is data that is repetitive and dominated by linguistically uninteresting boilerplate, such as price catalogs or computer-generated log files. Though this problem permeates many web-scraped corpora, there has yet to be a benchmark to test against, or a systematic study to find simple metrics that generalize across languages and agree with human judgements of data quality. In the present work, we create and release BREAD, a human-labeled benchmark on repetitive boilerplate vs. plausible linguistic content, spanning 360 languages. We release several baseline CRED (Character REDundancy) scores along with it, and evaluate their effectiveness on BREAD. We hope that the community will use this resource to develop better filtering methods, and that our reference implementations of CRED scores can become standard corpus evaluation tools, driving the development of cleaner language modeling corpora, especially in low-resource languages.
    Federated Learning for Connected and Automated Vehicles: A Survey of Existing Approaches and Challenges. (arXiv:2308.10407v2 [cs.LG] UPDATED)
    Machine learning (ML) is widely used for key tasks in Connected and Automated Vehicles (CAV), including perception, planning, and control. However, its reliance on vehicular data for model training presents significant challenges related to in-vehicle user privacy and communication overhead generated by massive data volumes. Federated learning (FL) is a decentralized ML approach that enables multiple vehicles to collaboratively develop models, broadening learning from various driving environments, enhancing overall performance, and simultaneously securing local vehicle data privacy and security. This survey paper presents a review of the advancements made in the application of FL for CAV (FL4CAV). First, centralized and decentralized frameworks of FL are analyzed, highlighting their key characteristics and methodologies. Second, diverse data sources, models, and data security techniques relevant to FL in CAVs are reviewed, emphasizing their significance in ensuring privacy and confidentiality. Third, specific applications of FL are explored, providing insight into the base models and datasets employed for each application. Finally, existing challenges for FL4CAV are listed and potential directions for future investigation to further enhance the effectiveness and efficiency of FL in the context of CAV are discussed.
    Learning Deductive Reasoning from Synthetic Corpus based on Formal Logic. (arXiv:2308.07336v2 [cs.AI] UPDATED)
    We study a synthetic corpus based approach for language models (LMs) to acquire logical deductive reasoning ability. The previous studies generated deduction examples using specific sets of deduction rules. However, these rules were limited or otherwise arbitrary, limiting the generalizability of acquired reasoning ability. We rethink this and adopt a well-grounded set of deduction rules based on formal logic theory, which can derive any other deduction rules when combined in a multistep way. Then, using the proposed corpora, which we name FLD (Formal Logic Deduction), we first evaluate and analyze the logical reasoning ability of the latest LLMs. Even GPT-4 can solve only half of the problems, suggesting that pure logical reasoning isolated from knowledge is still challenging for the LLMs, and additional training specialized in logical reasoning is indeed essential. We next empirically verify that LMs trained on FLD corpora acquire more generalizable reasoning ability. Furthermore, we identify the aspects of reasoning ability on which deduction corpora can enhance LMs and those on which they cannot, and discuss future directions on each aspect. The released corpora serve both as learning resources and as challenging benchmarks.
    Generating Personalized Insulin Treatments Strategies with Deep Conditional Generative Time Series Models. (arXiv:2309.16521v2 [stat.ML] UPDATED)
    We propose a novel framework that combines deep generative time series models with decision theory for generating personalized treatment strategies. It leverages historical patient trajectory data to jointly learn the generation of realistic personalized treatment and future outcome trajectories through deep generative time series models. In particular, our framework enables the generation of novel multivariate treatment strategies tailored to the personalized patient history and trained for optimal expected future outcomes based on conditional expected utility maximization. We demonstrate our framework by generating personalized insulin treatment strategies and blood glucose predictions for hospitalized diabetes patients, showcasing the potential of our approach for generating improved personalized treatment strategies. Keywords: deep generative model, probabilistic decision support, personalized treatment generation, insulin and blood glucose prediction
    No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models. (arXiv:2307.06440v3 [cs.LG] UPDATED)
    The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve training, validation, and downstream performance faster than standard training. In this work, we revisit three categories of such algorithms: dynamic architectures (layer stacking, layer dropping), batch selection (selective backprop, RHO loss), and efficient optimizers (Lion, Sophia). When pre-training BERT and T5 with a fixed computation budget using such methods, we find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate. We define an evaluation protocol that enables computation to be done on arbitrary machines by mapping all computation time to a reference machine which we call reference system time. We discuss the limitations of our proposed protocol and release our code to encourage rigorous research in efficient training procedures: https://github.com/JeanKaddour/NoTrainNoGain.
    Steering Prototypes with Prompt-tuning for Rehearsal-free Continual Learning. (arXiv:2303.09447v3 [cs.LG] UPDATED)
    In the context of continual learning, prototypes-as representative class embeddings-offer advantages in memory conservation and the mitigation of catastrophic forgetting. However, challenges related to semantic drift and prototype interference persist. In this study, we introduce the Contrastive Prototypical Prompt (CPP) approach. Through task-specific prompt-tuning, underpinned by a contrastive learning objective, we effectively address both aforementioned challenges. Our evaluations on four challenging class-incremental benchmarks reveal that CPP achieves a significant 4% to 6% improvement over state-of-the-art methods. Importantly, CPP operates without a rehearsal buffer and narrows the performance divergence between continual and offline joint-learning, suggesting an innovative scheme for Transformer-based continual learning systems.
    MixPro: Simple yet Effective Data Augmentation for Prompt-based Learning. (arXiv:2304.09402v2 [cs.CL] UPDATED)
    Prompt-based learning has shown considerable promise in reformulating various downstream tasks as cloze problems by combining original input with a predetermined template. This approach demonstrates its effectiveness, especially in few-shot learning scenarios, where the model is trained on a scarce amount of data. Despite its successes, the limited templates and text in few-shot prompt-based learning scenarios leave significant room for performance improvement. Moreover, existing methods sometimes resort to model ensembles, which, while effective, could potentially hamper model efficiency due to increased computational demands. To address these issues, we introduce MixPro, an augmentation method designed to augment both the vanilla input text and the templates. We implement this through the token-level, the sentence-level, and the template-level Mixup strategies. The experimental results on five few-shot datasets show that MixPro outperforms other augmentation baselines, improving model performance by an average of 5.08% compared to before augmentation.
    GBG++: A Fast and Stable Granular Ball Generation Method for Classification. (arXiv:2305.18450v2 [cs.LG] UPDATED)
    Granular ball computing (GBC), as an efficient, robust, and scalable learning method, has become a popular research topic of granular computing. GBC includes two stages: granular ball generation (GBG) and multi-granularity learning based on the granular ball (GB). However, the stability and efficiency of existing GBG methods need to be further improved due to their strong dependence on $k$-means or $k$-division. In addition, GB-based classifiers only unilaterally consider the GB's geometric characteristics to construct classification rules, but the GB's quality is ignored. Therefore, in this paper, based on the attention mechanism, a fast and stable GBG (GBG++) method is proposed first. Specifically, the proposed GBG++ method only needs to calculate the distances from the data-driven center to the undivided samples when splitting each GB instead of randomly selecting the center and calculating the distances between it and all samples. Moreover, an outlier detection method is introduced to identify local outliers. Consequently, the GBG++ method can significantly improve effectiveness, robustness, and efficiency while being absolutely stable. Second, considering the influence of the sample size within the GB on the GB's quality, based on the GBG++ method, an improved GB-based $k$-nearest neighbors algorithm (GB$k$NN++) is presented, which can reduce misclassification at the class boundary. Finally, the experimental results indicate that the proposed method outperforms several existing GB-based classifiers and classical machine learning classifiers on $24$ public benchmark datasets.
    DeepGD: A Multi-Objective Black-Box Test Selection Approach for Deep Neural Networks. (arXiv:2303.04878v3 [cs.LG] UPDATED)
    Deep neural networks (DNNs) are widely used in various application domains such as image processing, speech recognition, and natural language processing. However, testing DNN models may be challenging due to the complexity and size of their input domain. Particularly, testing DNN models often requires generating or exploring large unlabeled datasets. In practice, DNN test oracles, which identify the correct outputs for inputs, often require expensive manual effort to label test data, possibly involving multiple experts to ensure labeling correctness. In this paper, we propose DeepGD, a black-box multi-objective test selection approach for DNN models. It reduces the cost of labeling by prioritizing the selection of test inputs with high fault revealing power from large unlabeled datasets. DeepGD not only selects test inputs with high uncertainty scores to trigger as many mispredicted inputs as possible but also maximizes the probability of revealing distinct faults in the DNN model by selecting diverse mispredicted inputs. The experimental results conducted on four widely used datasets and five DNN models show that in terms of fault-revealing ability: (1) White-box, coverage-based approaches fare poorly, (2) DeepGD outperforms existing black-box test selection approaches in terms of fault detection, and (3) DeepGD also leads to better guidance for DNN model retraining when using selected inputs to augment the training set.
    Forecasting Workload in Cloud Computing: Towards Uncertainty-Aware Predictions and Transfer Learning. (arXiv:2303.13525v2 [cs.DC] UPDATED)
    Predicting future resource demand in Cloud Computing is essential for optimizing the trade-off between serving customers' requests efficiently and minimizing the provisioning cost. Modelling prediction uncertainty is also desirable to better inform the resource decision-making process, but research in this field is under-investigated. In this paper, we propose univariate and bivariate Bayesian deep learning models that provide predictions of future workload demand and its uncertainty. We run extensive experiments on Google and Alibaba clusters, where we first train our models with datasets from different cloud providers and compare them with LSTM-based baselines. Results show that modelling the uncertainty of predictions has a positive impact on performance, especially on service level metrics, because uncertainty quantification can be tailored to desired target service levels that are critical in cloud applications. Moreover, we investigate whether our models benefit transfer learning capabilities across different domains, i.e. dataset distributions. Experiments on the same workload datasets reveal that acceptable transfer learning performance can be achieved within the same provider (because distributions are more similar). Also, domain knowledge does not transfer when the source and target domains are very different (e.g. from different providers), but this performance degradation can be mitigated by increasing the training set size of the source domain.
    Verifiable Learning for Robust Tree Ensembles. (arXiv:2305.03626v4 [cs.LG] UPDATED)
    Verifying the robustness of machine learning models against evasion attacks at test time is an important research problem. Unfortunately, prior work established that this problem is NP-hard for decision tree ensembles, hence bound to be intractable for specific inputs. In this paper, we identify a restricted class of decision tree ensembles, called large-spread ensembles, which admit a security verification algorithm running in polynomial time. We then propose a new approach called verifiable learning, which advocates the training of such restricted model classes which are amenable for efficient verification. We show the benefits of this idea by designing a new training algorithm that automatically learns a large-spread decision tree ensemble from labelled data, thus enabling its security verification in polynomial time. Experimental results on public datasets confirm that large-spread ensembles trained using our algorithm can be verified in a matter of seconds, using standard commercial hardware. Moreover, large-spread ensembles are more robust than traditional ensembles against evasion attacks, at the cost of an acceptable loss of accuracy in the non-adversarial setting.
    Single-Call Stochastic Extragradient Methods for Structured Non-monotone Variational Inequalities: Improved Analysis under Weaker Conditions. (arXiv:2302.14043v2 [math.OC] UPDATED)
    Single-call stochastic extragradient methods, like stochastic past extragradient (SPEG) and stochastic optimistic gradient (SOG), have gained a lot of interest in recent years and are one of the most efficient algorithms for solving large-scale min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. However, despite their undoubted popularity, current convergence analyses of SPEG and SOG require a bounded variance assumption. In addition, several important questions regarding the convergence properties of these methods are still open, including mini-batching, efficient step-size selection, and convergence guarantees under different sampling strategies. In this work, we address these questions and provide convergence guarantees for two large classes of structured non-monotone VIPs: (i) quasi-strongly monotone problems (a generalization of strongly monotone problems) and (ii) weak Minty variational inequalities (a generalization of monotone and Minty VIPs). We introduce the expected residual condition, explain its benefits, and show how it can be used to obtain a strictly weaker bound than previously used growth conditions, expected co-coercivity, or bounded variance assumptions. Equipped with this condition, we provide theoretical guarantees for the convergence of single-call extragradient methods for different step-size selections, including constant, decreasing, and step-size-switching rules. Furthermore, our convergence analysis holds under the arbitrary sampling paradigm, which includes importance sampling and various mini-batching strategies as special cases.
    MetisFL: An Embarrassingly Parallelized Controller for Scalable & Efficient Federated Learning Workflows. (arXiv:2311.00334v2 [cs.LG] UPDATED)
    A Federated Learning (FL) system typically consists of two core processing entities: the federation controller and the learners. The controller is responsible for managing the execution of FL workflows across learners and the learners for training and evaluating federated models over their private datasets. While executing an FL workflow, the FL system has no control over the computational resources or data of the participating learners. Still, it is responsible for other operations, such as model aggregation, task dispatching, and scheduling. These computationally heavy operations generally need to be handled by the federation controller. Even though many FL systems have been recently proposed to facilitate the development of FL workflows, most of these systems overlook the scalability of the controller. To meet this need, we designed and developed a novel FL system called MetisFL, where the federation controller is the first-class citizen. MetisFL re-engineers all the operations conducted by the federation controller to accelerate the training of large-scale FL workflows. By quantitatively comparing MetisFL against other state-of-the-art FL systems, we empirically demonstrate that MetisFL leads to a 10-fold wall-clock time execution boost across a wide range of challenging FL workflows with increasing model sizes and federation sites.
    Optimal Goal-Reaching Reinforcement Learning via Quasimetric Learning. (arXiv:2304.01203v6 [cs.LG] UPDATED)
    In goal-reaching reinforcement learning (RL), the optimal value function has a particular geometry, called quasimetric structure. This paper introduces Quasimetric Reinforcement Learning (QRL), a new RL method that utilizes quasimetric models to learn optimal value functions. Distinct from prior approaches, the QRL objective is specifically designed for quasimetrics, and provides strong theoretical recovery guarantees. Empirically, we conduct thorough analyses on a discretized MountainCar environment, identifying properties of QRL and its advantages over alternatives. On offline and online goal-reaching benchmarks, QRL also demonstrates improved sample efficiency and performance, across both state-based and image-based observations.
    FACT: High-Dimensional Random Forests Inference. (arXiv:2207.01678v2 [stat.ML] UPDATED)
    Quantifying the usefulness of individual features in random forests learning can greatly enhance its interpretability. Existing studies have shown that some popularly used feature importance measures for random forests suffer from the bias issue. In addition, there lack comprehensive size and power analyses for most of these existing methods. In this paper, we approach the problem via hypothesis testing, and suggest a framework of the self-normalized feature-residual correlation test (FACT) for evaluating the significance of a given feature in the random forests model with bias-resistance property, where our null hypothesis concerns whether the feature is conditionally independent of the response given all other features. Such an endeavor on random forests inference is empowered by some recent developments on high-dimensional random forests consistency. Under a fairly general high-dimensional nonparametric model setting with dependent features, we formally establish that FACT can provide theoretically justified feature importance test with controlled type I error and enjoy appealing power property. The theoretical results and finite-sample advantages of the newly suggested method are illustrated with several simulation examples and an economic forecasting application.
    Towards Last-layer Retraining for Group Robustness with Fewer Annotations. (arXiv:2309.08534v2 [cs.LG] UPDATED)
    Empirical risk minimization (ERM) of neural networks is prone to over-reliance on spurious correlations and poor generalization on minority groups. The recent deep feature reweighting (DFR) technique achieves state-of-the-art group robustness via simple last-layer retraining, but it requires held-out group and class annotations to construct a group-balanced reweighting dataset. In this work, we examine this impractical requirement and find that last-layer retraining can be surprisingly effective with no group annotations (other than for model selection) and only a handful of class annotations. We first show that last-layer retraining can greatly improve worst-group accuracy even when the reweighting dataset has only a small proportion of worst-group data. This implies a "free lunch" where holding out a subset of training data to retrain the last layer can substantially outperform ERM on the entire dataset with no additional data or annotations. To further improve group robustness, we introduce a lightweight method called selective last-layer finetuning (SELF), which constructs the reweighting dataset using misclassifications or disagreements. Our empirical and theoretical results present the first evidence that model disagreement upsamples worst-group data, enabling SELF to nearly match DFR on four well-established benchmarks across vision and language tasks with no group annotations and less than 3% of the held-out class annotations. Our code is available at https://github.com/tmlabonte/last-layer-retraining.
    RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model. (arXiv:2308.05345v3 [cs.LG] UPDATED)
    Inspired by the recent success of large language models (LLMs) like ChatGPT, researchers start to explore the adoption of LLMs for agile hardware design, such as generating design RTL based on natural-language instructions. However, in existing works, their target designs are all relatively simple and in a small scale, and proposed by the authors themselves, making a fair comparison among different LLM solutions challenging. In addition, many prior works only focus on the design correctness, without evaluating the design qualities of generated design RTL. In this work, we propose an open-source benchmark named RTLLM, for generating design RTL with natural language instructions. To systematically evaluate the auto-generated design RTL, we summarized three progressive goals, named syntax goal, functionality goal, and design quality goal. This benchmark can automatically provide a quantitative evaluation of any given LLM-based solution. Furthermore, we propose an easy-to-use yet surprisingly effective prompt engineering technique named self-planning, which proves to significantly boost the performance of GPT-3.5 in our proposed benchmark.
    Hacking Generative Models with Differentiable Network Bending. (arXiv:2310.04816v2 [cs.CV] UPDATED)
    In this work, we propose a method to 'hack' generative models, pushing their outputs away from the original training distribution towards a new objective. We inject a small-scale trainable module between the intermediate layers of the model and train it for a low number of iterations, keeping the rest of the network frozen. The resulting output images display an uncanny quality, given by the tension between the original and new objectives that can be exploited for artistic purposes.
    A simple learning agent interacting with an agent-based market model. (arXiv:2208.10434v4 [q-fin.TR] UPDATED)
    We consider the learning dynamics of a single reinforcement learning optimal execution trading agent when it interacts with an event driven agent-based financial market model. Trading takes place asynchronously through a matching engine in event time. The optimal execution agent is considered at different levels of initial order-sizes and differently sized state spaces. The resulting impact on the agent-based model and market are considered using a calibration approach that explores changes in the empirical stylised facts and price impact curves. Convergence, volume trajectory and action trace plots are used to visualise the learning dynamics. Here the smaller state space agents had the number of states they visited converge much faster than the larger state space agents, and they were able to start learning to trade intuitively using the spread and volume states. We find that the moments of the model are robust to the impact of the learning agents except for the Hurst exponent, which was lowered by the introduction of strategic order-splitting. The introduction of the learning agent preserves the shape of the price impact curves but can reduce the trade-sign auto-correlations when their trading volumes increase.
    Mixed Semi-Supervised Generalized-Linear-Regression with applications to Deep-Learning and Interpolators. (arXiv:2302.09526v2 [stat.ME] UPDATED)
    We present a methodology for using unlabeled data to design semi supervised learning (SSL) methods that improve the prediction performance of supervised learning for regression tasks. The main idea is to design different mechanisms for integrating the unlabeled data, and include in each of them a mixing parameter $\alpha$, controlling the weight given to the unlabeled data. Focusing on Generalized Linear Models (GLM) and linear interpolators classes of models, we analyze the characteristics of different mixing mechanisms, and prove that in all cases, it is invariably beneficial to integrate the unlabeled data with some nonzero mixing ratio $\alpha>0$, in terms of predictive performance. Moreover, we provide a rigorous framework to estimate the best mixing ratio $\alpha^*$ where mixed SSL delivers the best predictive performance, while using the labeled and unlabeled data on hand. The effectiveness of our methodology in delivering substantial improvement compared to the standard supervised models, in a variety of settings, is demonstrated empirically through extensive simulation, in a manner that supports the theoretical analysis. We also demonstrate the applicability of our methodology (with some intuitive modifications) to improve more complex models, such as deep neural networks, in real-world regression tasks.
    Generalized Simultaneous Perturbation-based Gradient Search with Reduced Estimator Bias. (arXiv:2212.10477v2 [cs.LG] UPDATED)
    We present in this paper a family of generalized simultaneous perturbation-based gradient search (GSPGS) estimators that use noisy function measurements. The number of function measurements required by each estimator is guided by the desired level of accuracy. We first present in detail unbalanced generalized simultaneous perturbation stochastic approximation (GSPSA) estimators and later present the balanced versions (B-GSPSA) of these. We extend this idea further and present the generalized smoothed functional (GSF) and generalized random directions stochastic approximation (GRDSA) estimators, respectively, as well as their balanced variants. We show that estimators within any specified class requiring more number of function measurements result in lower estimator bias. We present a detailed analysis of both the asymptotic and non-asymptotic convergence of the resulting stochastic approximation schemes. We further present a series of experimental results with the various GSPGS estimators on the Rastrigin and quadratic function objectives. Our experiments are seen to validate our theoretical findings.
    Machine Learning for Practical Quantum Error Mitigation. (arXiv:2309.17368v1 [quant-ph] CROSS LISTED)
    Quantum computers are actively competing to surpass classical supercomputers, but quantum errors remain their chief obstacle. The key to overcoming these on near-term devices has emerged through the field of quantum error mitigation, enabling improved accuracy at the cost of additional runtime. In practice, however, the success of mitigation is limited by a generally exponential overhead. Can classical machine learning address this challenge on today's quantum computers? Here, through both simulations and experiments on state-of-the-art quantum computers using up to 100 qubits, we demonstrate that machine learning for quantum error mitigation (ML-QEM) can drastically reduce overheads, maintain or even surpass the accuracy of conventional methods, and yield near noise-free results for quantum algorithms. We benchmark a variety of machine learning models -- linear regression, random forests, multi-layer perceptrons, and graph neural networks -- on diverse classes of quantum circuits, over increasingly complex device-noise profiles, under interpolation and extrapolation, and for small and large quantum circuits. These tests employ the popular digital zero-noise extrapolation method as an added reference. We further show how to scale ML-QEM to classically intractable quantum circuits by mimicking the results of traditional mitigation results, while significantly reducing overhead. Our results highlight the potential of classical machine learning for practical quantum computation.
    PATROL: Privacy-Oriented Pruning for Collaborative Inference Against Model Inversion Attacks. (arXiv:2307.10981v2 [cs.LG] UPDATED)
    Collaborative inference has been a promising solution to enable resource-constrained edge devices to perform inference using state-of-the-art deep neural networks (DNNs). In collaborative inference, the edge device first feeds the input to a partial DNN locally and then uploads the intermediate result to the cloud to complete the inference. However, recent research indicates model inversion attacks (MIAs) can reconstruct input data from intermediate results, posing serious privacy concerns for collaborative inference. Existing perturbation and cryptography techniques are inefficient and unreliable in defending against MIAs while performing accurate inference. This paper provides a viable solution, named PATROL, which develops privacy-oriented pruning to balance privacy, efficiency, and utility of collaborative inference. PATROL takes advantage of the fact that later layers in a DNN can extract more task-specific features. Given limited local resources for collaborative inference, PATROL intends to deploy more layers at the edge based on pruning techniques to enforce task-specific features for inference and reduce task-irrelevant but sensitive features for privacy preservation. To achieve privacy-oriented pruning, PATROL introduces two key components: Lipschitz regularization and adversarial reconstruction training, which increase the reconstruction errors by reducing the stability of MIAs and enhance the target inference model by adversarial training, respectively. On a real-world collaborative inference task, vehicle re-identification, we demonstrate the superior performance of PATROL in terms of against MIAs.
    Joint Group Invariant Functions on Data-Parameter Domain Induce Universal Neural Networks. (arXiv:2310.03530v2 [cs.LG] UPDATED)
    The symmetry and geometry of input data are considered to be encoded in the internal data representation inside the neural network, but the specific encoding rule has been less investigated. In this study, we present a systematic method to induce a generalized neural network and its right inverse operator, called the ridgelet transform, from a joint group invariant function on the data-parameter domain. Since the ridgelet transform is an inverse, (1) it can describe the arrangement of parameters for the network to represent a target function, which is understood as the encoding rule, and (2) it implies the universality of the network. Based on the group representation theory, we present a new simple proof of the universality by using Schur's lemma in a unified manner covering a wide class of networks, for example, the original ridgelet transform, formal deep networks, and the dual voice transform. Since traditional universality theorems were demonstrated based on functional analysis, this study sheds light on the group theoretic aspect of the approximation theory, connecting geometric deep learning to abstract harmonic analysis.
    Stacked networks improve physics-informed training: applications to neural networks and deep operator networks. (arXiv:2311.06483v1 [cs.LG])
    Physics-informed neural networks and operator networks have shown promise for effectively solving equations modeling physical systems. However, these networks can be difficult or impossible to train accurately for some systems of equations. We present a novel multifidelity framework for stacking physics-informed neural networks and operator networks that facilitates training. We successively build a chain of networks, where the output at one step can act as a low-fidelity input for training the next step, gradually increasing the expressivity of the learned model. The equations imposed at each step of the iterative process can be the same or different (akin to simulated annealing). The iterative (stacking) nature of the proposed method allows us to progressively learn features of a solution that are hard to learn directly. Through benchmark problems including a nonlinear pendulum, the wave equation, and the viscous Burgers equation, we show how stacking can be used to improve the accuracy and reduce the required size of physics-informed neural networks and operator networks.
    Tackling the Curse of Dimensionality with Physics-Informed Neural Networks. (arXiv:2307.12306v4 [cs.LG] UPDATED)
    The curse-of-dimensionality taxes computational resources heavily with exponentially increasing computational cost as the dimension increases. This poses great challenges in solving high-dimensional PDEs, as Richard E. Bellman first pointed out over 60 years ago. While there has been some recent success in solving numerically partial differential equations (PDEs) in high dimensions, such computations are prohibitively expensive, and true scaling of general nonlinear PDEs to high dimensions has never been achieved. We develop a new method of scaling up physics-informed neural networks (PINNs) to solve arbitrary high-dimensional PDEs. The new method, called Stochastic Dimension Gradient Descent (SDGD), decomposes a gradient of PDEs into pieces corresponding to different dimensions and randomly samples a subset of these dimensional pieces in each iteration of training PINNs. We prove theoretically the convergence and other desired properties of the proposed method. We demonstrate in various diverse tests that the proposed method can solve many notoriously hard high-dimensional PDEs, including the Hamilton-Jacobi-Bellman (HJB) and the Schr\"{o}dinger equations in tens of thousands of dimensions very fast on a single GPU using the PINNs mesh-free approach. Notably, we solve nonlinear PDEs with nontrivial, anisotropic, and inseparable solutions in 100,000 effective dimensions in 12 hours on a single GPU using SDGD with PINNs. Since SDGD is a general training methodology of PINNs, it can be applied to any current and future variants of PINNs to scale them up for arbitrary high-dimensional PDEs.
    Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio. (arXiv:2311.06285v1 [cs.CV])
    While 3D human body modeling has received much attention in computer vision, modeling the acoustic equivalent, i.e. modeling 3D spatial audio produced by body motion and speech, has fallen short in the community. To close this gap, we present a model that can generate accurate 3D spatial audio for full human bodies. The system consumes, as input, audio signals from headset microphones and body pose, and produces, as output, a 3D sound field surrounding the transmitter's body, from which spatial audio can be rendered at any arbitrary position in the 3D space. We collect a first-of-its-kind multimodal dataset of human bodies, recorded with multiple cameras and a spherical array of 345 microphones. In an empirical evaluation, we demonstrate that our model can produce accurate body-induced sound fields when trained with a suitable loss. Dataset and code are available online.
    Machine Learning Classification Techniques for Identifying the Defective Patterns in Semiconductor Wafer Maps: A Survey, Empirical, and Experimental Evaluations. (arXiv:2310.10705v2 [cs.LG] UPDATED)
    This survey paper offers a comprehensive review of methodologies utilizing machine learning (ML) classification techniques for identifying wafer defects in semiconductor manufacturing. Despite the growing body of research demonstrating the effectiveness of ML in wafer defect identification, there is a noticeable absence of comprehensive reviews on this subject. This survey attempts to fill this void by amalgamating available literature and providing an in-depth analysis of the advantages, limitations, and potential applications of various ML classification algorithms in the realm of wafer defect detection. An innovative taxonomy of methodologies that we present provides a detailed classification of algorithms into more refined categories and techniques. This taxonomy follows a four-tier structure, starting from broad methodology categories and ending with specific sub-techniques. It aids researchers in comprehending the complex relationships between different algorithms and their techniques. We employ a rigorous empirical and experimental evaluation to rank these varying techniques. For the empirical evaluation, we assess techniques based on a set of four criteria. The experimental evaluation ranks the algorithms employing the same sub-techniques, techniques, sub-categories, and categories. This integration of a multi-layered taxonomy, empirical evaluations, and comparative experiments provides a detailed and holistic understanding of ML techniques and algorithms for identifying wafer defects. Additionally, the paper illuminates the future prospects of ML classification techniques for wafer defect identification, underscoring potential advancements and opportunities for further research in this field
    RLTF: Reinforcement Learning from Unit Test Feedback. (arXiv:2307.04349v2 [cs.AI] UPDATED)
    The goal of program synthesis, or code generation, is to generate executable code based on given descriptions. Recently, there has been an increasing number of studies employing reinforcement learning (RL) to improve the performance of large language models (LLMs) for code. However, current representative works either rely solely on offline frameworks, limiting the exploration of new sample spaces, or fall short in the utilization of unit test signals, not accounting for specific error locations within the code. To address these issues, we propose RLTF, i.e., Reinforcement Learning from Unit Test Feedback, a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs. Our approach generates data in real-time during training and simultaneously utilizes fine-grained feedback signals to guide the model towards producing higher-quality code. Extensive experiments show that RLTF achieves state-of-the-art performance on the APPS and the MBPP benchmarks. Our code is available at: https://github.com/Zyq-scut/RLTF.
    Anytime Model Selection in Linear Bandits. (arXiv:2307.12897v2 [stat.ML] UPDATED)
    Model selection in the context of bandit optimization is a challenging problem, as it requires balancing exploration and exploitation not only for action selection, but also for model selection. One natural approach is to rely on online learning algorithms that treat different models as experts. Existing methods, however, scale poorly ($\text{poly}M$) with the number of models $M$ in terms of their regret. Our key insight is that, for model selection in linear bandits, we can emulate full-information feedback to the online learner with a favorable bias-variance trade-off. This allows us to develop ALEXP, which has an exponentially improved ($\log M$) dependence on $M$ for its regret. ALEXP has anytime guarantees on its regret, and neither requires knowledge of the horizon $n$, nor relies on an initial purely exploratory stage. Our approach utilizes a novel time-uniform analysis of the Lasso, establishing a new connection between online learning and high-dimensional statistics.
    A Critical Re-evaluation of Benchmark Datasets for (Deep) Learning-Based Matching Algorithms. (arXiv:2307.01231v2 [cs.DB] UPDATED)
    Entity resolution (ER) is the process of identifying records that refer to the same entities within one or across multiple databases. Numerous techniques have been developed to tackle ER challenges over the years, with recent emphasis placed on machine and deep learning methods for the matching phase. However, the quality of the benchmark datasets typically used in the experimental evaluations of learning-based matching algorithms has not been examined in the literature. To cover this gap, we propose four different approaches to assessing the difficulty and appropriateness of 13 established datasets: two theoretical approaches, which involve new measures of linearity and existing measures of complexity, and two practical approaches: the difference between the best non-linear and linear matchers, as well as the difference between the best learning-based matcher and the perfect oracle. Our analysis demonstrates that most of the popular datasets pose rather easy classification tasks. As a result, they are not suitable for properly evaluating learning-based matching algorithms. To address this issue, we propose a new methodology for yielding benchmark datasets. We put it into practice by creating four new matching tasks, and we verify that these new benchmarks are more challenging and therefore more suitable for further advancements in the field.
    SciRepEval: A Multi-Format Benchmark for Scientific Document Representations. (arXiv:2211.13308v4 [cs.CL] UPDATED)
    Learned representations of scientific documents can serve as valuable input features for downstream tasks without further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In response, we introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations. It includes 24 challenging and realistic tasks, 8 of which are new, across four formats: classification, regression, ranking and search. We then use this benchmark to study and improve the generalization ability of scientific document representation models. We show how state-of-the-art models like SPECTER and SciNCL struggle to generalize across the task formats, and that simple multi-task training fails to improve them. However, a new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance. We experiment with task-format-specific control codes and adapters and find they outperform the existing single-embedding state-of-the-art by over 2 points absolute. We release the resulting family of multi-format models, called SPECTER2, for the community to use and build on.
    Tighter Bounds on the Expressivity of Transformer Encoders. (arXiv:2301.10743v3 [cs.LG] UPDATED)
    Characterizing neural networks in terms of better-understood formal systems has the potential to yield new insights into the power and limitations of these networks. Doing so for transformers remains an active area of research. Bhattamishra and others have shown that transformer encoders are at least as expressive as a certain kind of counter machine, while Merrill and Sabharwal have shown that fixed-precision transformer encoders recognize only languages in uniform $TC^0$. We connect and strengthen these results by identifying a variant of first-order logic with counting quantifiers that is simultaneously an upper bound for fixed-precision transformer encoders and a lower bound for transformer encoders. This brings us much closer than before to an exact characterization of the languages that transformer encoders recognize.
    On Dynamic Pricing with Covariates. (arXiv:2112.13254v3 [cs.LG] UPDATED)
    We consider dynamic pricing with covariates under a generalized linear demand model: a seller can dynamically adjust the price of a product over a horizon of $T$ time periods, and at each time period $t$, the demand of the product is jointly determined by the price and an observable covariate vector $x_t\in\mathbb{R}^d$ through a generalized linear model with unknown co-efficients. Most of the existing literature assumes the covariate vectors $x_t$'s are independently and identically distributed (i.i.d.); the few papers that relax this assumption either sacrifice model generality or yield sub-optimal regret bounds. In this paper, we show that UCB and Thompson sampling-based pricing algorithms can achieve an $O(d\sqrt{T}\log T)$ regret upper bound without assuming any statistical structure on the covariates $x_t$. Our upper bound on the regret matches the lower bound up to logarithmic factors. We thus show that (i) the i.i.d. assumption is not necessary for obtaining low regret, and (ii) the regret bound can be independent of the (inverse) minimum eigenvalue of the covariance matrix of the $x_t$'s, a quantity present in previous bounds. Moreover, we consider a constrained setting of the dynamic pricing problem where there is a limited and unreplenishable inventory and we develop theoretical results that relate the best achievable algorithm performance to a variation measure with respect to the temporal distribution shift of the covariates. We also discuss conditions under which a better regret is achievable and demonstrate the proposed algorithms' performance with numerical experiments.
    A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity. (arXiv:2305.13169v2 [cs.CL] UPDATED)
    Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we quantify the effect of pretraining data age. A temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. These findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in LM development.
    Streamlining Energy Transition Scenarios to Key Policy Decisions. (arXiv:2311.06625v1 [cs.LG])
    Uncertainties surrounding the energy transition often lead modelers to present large sets of scenarios that are challenging for policymakers to interpret and act upon. An alternative approach is to define a few qualitative storylines from stakeholder discussions, which can be affected by biases and infeasibilities. Leveraging decision trees, a popular machine-learning technique, we derive interpretable storylines from many quantitative scenarios and show how the key decisions in the energy transition are interlinked. Specifically, our results demonstrate that choosing a high deployment of renewables and sector coupling makes global decarbonization scenarios robust against uncertainties in climate sensitivity and demand. Also, the energy transition to a fossil-free Europe is primarily determined by choices on the roles of bioenergy, storage, and heat electrification. Our transferrable approach translates vast energy model results into a small set of critical decisions, guiding decision-makers in prioritizing the key factors that will shape the energy transition.
    Alternating minimization algorithms for graph regularized tensor completion. (arXiv:2008.12876v2 [math.NA] UPDATED)
    We consider a Canonical Polyadic (CP) decomposition approach to low-rank tensor completion (LRTC) by incorporating external pairwise similarity relations through graph Laplacian regularization on the CP factor matrices. The usage of graph regularization entails benefits in the learning accuracy of LRTC, but at the same time, induces coupling graph Laplacian terms that hinder the optimization of the tensor completion model. In order to solve graph-regularized LRTC, we propose efficient alternating minimization algorithms by leveraging the block structure of the underlying CP decomposition-based model. For the subproblems of alternating minimization, a linear conjugate gradient subroutine is specifically adapted to graph-regularized LRTC. Alternatively, we circumvent the complicating coupling effects of graph Laplacian terms by using an alternating directions method of multipliers. Based on the Kurdyka-{\L}ojasiewicz property, we show that the sequence generated by the proposed algorithms globally converges to a critical point of the objective function. Moreover, the complexity and convergence rate are also derived. In addition, numerical experiments including synthetic data and real data show that the graph regularized tensor completion model has improved recovery results compared to those without graph regularization, and that the proposed algorithms achieve gains in time efficiency over existing algorithms.
    Theory and implementation of inelastic Constitutive Artificial Neural Networks. (arXiv:2311.06380v1 [cs.LG])
    Nature has always been our inspiration in the research, design and development of materials and has driven us to gain a deep understanding of the mechanisms that characterize anisotropy and inelastic behavior. All this knowledge has been accumulated in the principles of thermodynamics. Deduced from these principles, the multiplicative decomposition combined with pseudo potentials are powerful and universal concepts. Simultaneously, the tremendous increase in computational performance enabled us to investigate and rethink our history-dependent material models to make the most of our predictions. Today, we have reached a point where materials and their models are becoming increasingly sophisticated. This raises the question: How do we find the best model that includes all inelastic effects to explain our complex data? Constitutive Artificial Neural Networks (CANN) may answer this question. Here, we extend the CANNs to inelastic materials (iCANN). Rigorous considerations of objectivity, rigid motion of the reference configuration, multiplicative decomposition and its inherent non-uniqueness, restrictions of energy and pseudo potential, and consistent evolution guide us towards the architecture of the iCANN satisfying thermodynamics per design. We combine feed-forward networks of the free energy and pseudo potential with a recurrent neural network approach to take time dependencies into account. We demonstrate that the iCANN is capable of autonomously discovering models for artificially generated data, the response of polymers for cyclic loading and the relaxation behavior of muscle data. As the design of the network is not limited to visco-elasticity, our vision is that the iCANN will reveal to us new ways to find the various inelastic phenomena hidden in the data and to understand their interaction. Our source code, data, and examples are available at doi.org/10.5281/zenodo.10066805
    Simulator-Based Inference with Waldo: Confidence Regions by Leveraging Prediction Algorithms and Posterior Estimators for Inverse Problems. (arXiv:2205.15680v4 [stat.ML] UPDATED)
    Prediction algorithms, such as deep neural networks (DNNs), are used in many domain sciences to directly estimate internal parameters of interest in simulator-based models, especially in settings where the observations include images or complex high-dimensional data. In parallel, modern neural density estimators, such as normalizing flows, are becoming increasingly popular for uncertainty quantification, especially when both parameters and observations are high-dimensional. However, parameter inference is an inverse problem and not a prediction task; thus, an open challenge is to construct conditionally valid and precise confidence regions, with a guaranteed probability of covering the true parameters of the data-generating process, no matter what the (unknown) parameter values are, and without relying on large-sample theory. Many simulator-based inference (SBI) methods are indeed known to produce biased or overly confident parameter regions, yielding misleading uncertainty estimates. This paper presents WALDO, a novel method to construct confidence regions with finite-sample conditional validity by leveraging prediction algorithms or posterior estimators that are currently widely adopted in SBI. WALDO reframes the well-known Wald test statistic, and uses a computationally efficient regression-based machinery for classical Neyman inversion of hypothesis tests. We apply our method to a recent high-energy physics problem, where prediction with DNNs has previously led to estimates with prediction bias. We also illustrate how our approach can correct overly confident posterior regions computed with normalizing flows.
    The Challenges of HTR Model Training: Feedback from the Project Donner le gout de l'archive a l'ere numerique. (arXiv:2212.11146v4 [cs.CV] UPDATED)
    The arrival of handwriting recognition technologies offers new possibilities for research in heritage studies. However, it is now necessary to reflect on the experiences and the practices developed by research teams. Our use of the Transkribus platform since 2018 has led us to search for the most significant ways to improve the performance of our handwritten text recognition (HTR) models which are made to transcribe French handwriting dating from the 17th century. This article therefore reports on the impacts of creating transcribing protocols, using the language model at full scale and determining the best way to use base models in order to help increase the performance of HTR models. Combining all of these elements can indeed increase the performance of a single model by more than 20% (reaching a Character Error Rate below 5%). This article also discusses some challenges regarding the collaborative nature of HTR platforms such as Transkribus and the way researchers can share their data generated in the process of creating or training handwritten text recognition models.
    Sparse Attention-Based Neural Networks for Code Classification. (arXiv:2311.06575v1 [cs.PL])
    Categorizing source codes accurately and efficiently is a challenging problem in real-world programming education platform management. In recent years, model-based approaches utilizing abstract syntax trees (ASTs) have been widely applied to code classification tasks. We introduce an approach named the Sparse Attention-based neural network for Code Classification (SACC) in this paper. The approach involves two main steps: In the first step, source code undergoes syntax parsing and preprocessing. The generated abstract syntax tree is split into sequences of subtrees and then encoded using a recursive neural network to obtain a high-dimensional representation. This step simultaneously considers both the logical structure and lexical level information contained within the code. In the second step, the encoded sequences of subtrees are fed into a Transformer model that incorporates sparse attention mechanisms for the purpose of classification. This method efficiently reduces the computational cost of the self-attention mechanisms, thus improving the training speed while preserving effectiveness. Our work introduces a carefully designed sparse attention pattern that is specifically designed to meet the unique needs of code classification tasks. This design helps reduce the influence of redundant information and enhances the overall performance of the model. Finally, we also deal with problems in previous related research, which include issues like incomplete classification labels and a small dataset size. We annotated the CodeNet dataset with algorithm-related labeling categories, which contains a significantly large amount of data. Extensive comparative experimental results demonstrate the effectiveness and efficiency of SACC for the code classification tasks.
    Sparsely Activated Networks. (arXiv:1907.06592v10 [cs.LG] UPDATED)
    Previous literature on unsupervised learning focused on designing structural priors with the aim of learning meaningful features. However, this was done without considering the description length of the learned representations which is a direct and unbiased measure of the model complexity. In this paper, first we introduce the $\varphi$ metric that evaluates unsupervised models based on their reconstruction accuracy and the degree of compression of their internal representations. We then present and define two activation functions (Identity, ReLU) as base of reference and three sparse activation functions (top-k absolutes, Extrema-Pool indices, Extrema) as candidate structures that minimize the previously defined $\varphi$. We lastly present Sparsely Activated Networks (SANs) that consist of kernels with shared weights that, during encoding, are convolved with the input and then passed through a sparse activation function. During decoding, the same weights are convolved with the sparse activation map and subsequently the partial reconstructions from each weight are summed to reconstruct the input. We compare SANs using the five previously defined activation functions on a variety of datasets (Physionet, UCI-epilepsy, MNIST, FMNIST) and show that models that are selected using $\varphi$ have small description representation length and consist of interpretable kernels.
    Sampling weights of deep neural networks. (arXiv:2306.16830v2 [cs.LG] UPDATED)
    We introduce a probability distribution, combined with an efficient sampling algorithm, for weights and biases of fully-connected neural networks. In a supervised learning context, no iterative optimization or gradient computations of internal network parameters are needed to obtain a trained network. The sampling is based on the idea of random feature models. However, instead of a data-agnostic distribution, e.g., a normal distribution, we use both the input and the output training data to sample shallow and deep networks. We prove that sampled networks are universal approximators. For Barron functions, we show that the $L^2$-approximation error of sampled shallow networks decreases with the square root of the number of neurons. Our sampling scheme is invariant to rigid body transformations and scaling of the input data, which implies many popular pre-processing techniques are not required. In numerical experiments, we demonstrate that sampled networks achieve accuracy comparable to iteratively trained ones, but can be constructed orders of magnitude faster. Our test cases involve a classification benchmark from OpenML, sampling of neural operators to represent maps in function spaces, and transfer learning using well-known architectures.
    Privacy Risks Analysis and Mitigation in Federated Learning for Medical Images. (arXiv:2311.06643v1 [cs.LG])
    Federated learning (FL) is gaining increasing popularity in the medical domain for analyzing medical images, which is considered an effective technique to safeguard sensitive patient data and comply with privacy regulations. However, several recent studies have revealed that the default settings of FL may leak private training data under privacy attacks. Thus, it is still unclear whether and to what extent such privacy risks of FL exist in the medical domain, and if so, ``how to mitigate such risks?''. In this paper, first, we propose a holistic framework for Medical data Privacy risk analysis and mitigation in Federated Learning (MedPFL) to analyze privacy risks and develop effective mitigation strategies in FL for protecting private medical data. Second, we demonstrate the substantial privacy risks of using FL to process medical images, where adversaries can easily perform privacy attacks to reconstruct private medical images accurately. Third, we show that the defense approach of adding random noises may not always work effectively to protect medical images against privacy attacks in FL, which poses unique and pressing challenges associated with medical data for privacy protection.
    The Pros and Cons of Using Machine Learning and Interpretable Machine Learning Methods in psychiatry detection applications, specifically depression disorder: A Brief Review. (arXiv:2311.06633v1 [cs.LG])
    The COVID-19 pandemic has forced many people to limit their social activities, which has resulted in a rise in mental illnesses, particularly depression. To diagnose these illnesses with accuracy and speed, and prevent severe outcomes such as suicide, the use of machine learning has become increasingly important. Additionally, to provide precise and understandable diagnoses for better treatment, AI scientists and researchers must develop interpretable AI-based solutions. This article provides an overview of relevant articles in the field of machine learning and interpretable AI, which helps to understand the advantages and disadvantages of using AI in psychiatry disorder detection applications.
    Adversarial Fine-tuning using Generated Respiratory Sound to Address Class Imbalance. (arXiv:2311.06480v1 [cs.SD])
    Deep generative models have emerged as a promising approach in the medical image domain to address data scarcity. However, their use for sequential data like respiratory sounds is less explored. In this work, we propose a straightforward approach to augment imbalanced respiratory sound data using an audio diffusion model as a conditional neural vocoder. We also demonstrate a simple yet effective adversarial fine-tuning method to align features between the synthetic and real respiratory sound samples to improve respiratory sound classification performance. Our experimental results on the ICBHI dataset demonstrate that the proposed adversarial fine-tuning is effective, while only using the conventional augmentation method shows performance degradation. Moreover, our method outperforms the baseline by 2.24% on the ICBHI Score and improves the accuracy of the minority classes up to 26.58%. For the supplementary material, we provide the code at https://github.com/kaen2891/adversarial_fine-tuning_using_generated_respiratory_sound.
    StEik: Stabilizing the Optimization of Neural Signed Distance Functions and Finer Shape Representation. (arXiv:2305.18414v3 [cs.CV] UPDATED)
    We present new insights and a novel paradigm (StEik) for learning implicit neural representations (INR) of shapes. In particular, we shed light on the popular eikonal loss used for imposing a signed distance function constraint in INR. We show analytically that as the representation power of the network increases, the optimization approaches a partial differential equation (PDE) in the continuum limit that is unstable. We show that this instability can manifest in existing network optimization, leading to irregularities in the reconstructed surface and/or convergence to sub-optimal local minima, and thus fails to capture fine geometric and topological structure. We show analytically how other terms added to the loss, currently used in the literature for other purposes, can actually eliminate these instabilities. However, such terms can over-regularize the surface, preventing the representation of fine shape detail. Based on a similar PDE theory for the continuum limit, we introduce a new regularization term that still counteracts the eikonal instability but without over-regularizing. Furthermore, since stability is now guaranteed in the continuum limit, this stabilization also allows for considering new network structures that are able to represent finer shape detail. We introduce such a structure based on quadratic layers. Experiments on multiple benchmark data sets show that our new regularization and network are able to capture more precise shape details and more accurate topology than existing state-of-the-art.
    Edge2Node: Reducing Edge Prediction to Node Classification. (arXiv:2311.02921v2 [cs.LG] UPDATED)
    Despite the success of graph neural network models in node classification, edge prediction (the task of predicting missing or potential links between nodes in a graph) remains a challenging problem for these models. A common approach for edge prediction is to first obtain the embeddings of two nodes, and then a predefined scoring function is used to predict the existence of an edge between the two nodes. In this paper, we introduce a new approach called Edge2Node (E2N) which directly obtains an embedding for each edge, without the need for a scoring function. To do this, we create a new graph H based on the graph G given for the edge prediction task, and then reduce the edge prediction task on G to a node classification task on H. Our E2N method can be easily applied to any edge prediction task with superior performance and lower computational costs. Our E2N method beats the best-known methods on the leaderboards for ogbl-ppa, ogbl-collab, and ogbl-ddi datasets by 25.89%, 24.19%, and 0.34% improvements, respectively.
    Byte Pair Encoding for Symbolic Music. (arXiv:2301.11975v3 [cs.LG] UPDATED)
    When used with deep learning, the symbolic music modality is often coupled with language model architectures. To do so, the music needs to be tokenized, i.e. converted into a sequence of discrete tokens. This can be achieved by different approaches, as music can be composed of simultaneous tracks, of simultaneous notes with several attributes. Until now, the proposed tokenizations rely on small vocabularies of tokens describing the note attributes and time events, resulting in fairly long token sequences, and a sub-optimal use of the embedding space of language models. Recent research has put efforts on reducing the overall sequence length by merging embeddings or combining tokens. In this paper, we show that Byte Pair Encoding, a compression technique widely used for natural language, significantly decreases the sequence length while increasing the vocabulary size. By doing so, we leverage the embedding capabilities of such models with more expressive tokens, resulting in both better results and faster inference in generation and classification tasks. The source code is shared on Github, along with a companion website. Finally, BPE is directly implemented in MidiTok, allowing the reader to easily benefit from this method.
    Welfare and Fairness in Multi-objective Reinforcement Learning. (arXiv:2212.01382v5 [cs.GT] UPDATED)
    We study fair multi-objective reinforcement learning in which an agent must learn a policy that simultaneously achieves high reward on multiple dimensions of a vector-valued reward. Motivated by the fair resource allocation literature, we model this as an expected welfare maximization problem, for some nonlinear fair welfare function of the vector of long-term cumulative rewards. One canonical example of such a function is the Nash Social Welfare, or geometric mean, the log transform of which is also known as the Proportional Fairness objective. We show that even approximately optimal optimization of the expected Nash Social Welfare is computationally intractable even in the tabular case. Nevertheless, we provide a novel adaptation of Q-learning that combines nonlinear scalarized learning updates and non-stationary action selection to learn effective policies for optimizing nonlinear welfare functions. We show that our algorithm is provably convergent, and we demonstrate experimentally that our approach outperforms techniques based on linear scalarization, mixtures of optimal linear scalarizations, or stationary action selection for the Nash Social Welfare Objective.
    Slowly Varying Regression under Sparsity. (arXiv:2102.10773v5 [cs.LG] UPDATED)
    We present the framework of slowly varying regression under sparsity, allowing sparse regression models to exhibit slow and sparse variations. The problem of parameter estimation is formulated as a mixed-integer optimization problem. We demonstrate that it can be precisely reformulated as a binary convex optimization problem through a novel relaxation technique. This relaxation involves a new equality on Moore-Penrose inverses, convexifying the non-convex objective function while matching the original objective on all feasible binary points. This enables us to efficiently solve the problem to provable optimality using a cutting plane-type algorithm. We develop a highly optimized implementation of this algorithm, substantially improving upon the asymptotic computational complexity of a straightforward implementation. Additionally, we propose a fast heuristic method that guarantees a feasible solution and, as empirically illustrated, produces high-quality warm-start solutions for the binary optimization problem. To tune the framework's hyperparameters, we suggest a practical procedure relying on binary search that, under certain assumptions, is guaranteed to recover the true model parameters. On both synthetic and real-world datasets, we demonstrate that the resulting algorithm outperforms competing formulations in comparable times across various metrics, including estimation accuracy, predictive power, and computational time. The algorithm is highly scalable, allowing us to train models with thousands of parameters. Our implementation is available open-source at https://github.com/vvdigalakis/SSVRegression.git.
    DeepSeq: Deep Sequential Circuit Learning. (arXiv:2302.13608v2 [cs.LG] UPDATED)
    Circuit representation learning is a promising research direction in the electronic design automation (EDA) field. With sufficient data for pre-training, the learned general yet effective representation can help to solve multiple downstream EDA tasks by fine-tuning it on a small set of task-related data. However, existing solutions only target combinational circuits, significantly limiting their applications. In this work, we propose DeepSeq, a novel representation learning framework for sequential netlists. Specifically, we introduce a dedicated graph neural network (GNN) with a customized propagation scheme to exploit the temporal correlations between gates in sequential circuits. To ensure effective learning, we propose to use a multi-task training objective with two sets of strongly related supervision: logic probability and transition probability at each node. A novel dual attention aggregation mechanism is introduced to facilitate learning both tasks efficiently. Experimental results on various benchmark circuits show that DeepSeq outperforms other GNN models for sequential circuit learning. We evaluate the generalization capability of DeepSeq on a downstream power estimation task. After fine-tuning, DeepSeq can accurately estimate power across various circuits under different workloads.
    Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion. (arXiv:2311.06318v1 [cs.IR])
    Large Language Models (LLMs) excel at tackling various natural language tasks. However, due to the significant costs involved in re-training or fine-tuning them, they remain largely static and difficult to personalize. Nevertheless, a variety of applications could benefit from generations that are tailored to users' preferences, goals, and knowledge. Among them is web search, where knowing what a user is trying to accomplish, what they care about, and what they know can lead to improved search experiences. In this work, we propose a novel and general approach that augments an LLM with relevant context from users' interaction histories with a search engine in order to personalize its outputs. Specifically, we construct an entity-centric knowledge store for each user based on their search and browsing activities on the web, which is then leveraged to provide contextually relevant LLM prompt augmentations. This knowledge store is light-weight, since it only produces user-specific aggregate projections of interests and knowledge onto public knowledge graphs, and leverages existing search log infrastructure, thereby mitigating the privacy, compliance, and scalability concerns associated with building deep user profiles for personalization. We then validate our approach on the task of contextual query suggestion, which requires understanding not only the user's current search context but also what they historically know and care about. Through a number of experiments based on human evaluation, we show that our approach is significantly better than several other LLM-powered baselines, generating query suggestions that are contextually more relevant, personalized, and useful.
    InstructABSA: Instruction Learning for Aspect Based Sentiment Analysis. (arXiv:2302.08624v6 [cs.CL] UPDATED)
    We introduce InstructABSA, an instruction learning paradigm for Aspect-Based Sentiment Analysis (ABSA) subtasks. Our method introduces positive, negative, and neutral examples to each training sample, and instruction tune the model (Tk-Instruct) for ABSA subtasks, yielding significant performance improvements. Experimental results on the Sem Eval 2014, 15, and 16 datasets demonstrate that InstructABSA outperforms the previous state-of-the-art (SOTA) approaches on Term Extraction (ATE), Sentiment Classification(ATSC) and Sentiment Pair Extraction (ASPE) subtasks. In particular, InstructABSA outperforms the previous state-of-the-art (SOTA) on the Rest14 ATE subtask by 5.69% points, the Rest15 ATSC subtask by 9.59% points, and the Lapt14 AOPE subtask by 3.37% points, surpassing 7x larger models. We also get competitive results on AOOE, AOPE, and AOSTE subtasks indicating strong generalization ability to all subtasks. Exploring sample efficiency reveals that just 50% train data is required to get competitive results with other instruction tuning approaches. Lastly, we assess the quality of instructions and observe that InstructABSA's performance experiences a decline of ~10% when adding misleading examples.
    Compact Matrix Quantum Group Equivariant Neural Networks. (arXiv:2311.06358v1 [cs.LG])
    We derive the existence of a new type of neural network, called a compact matrix quantum group equivariant neural network, that learns from data that has an underlying quantum symmetry. We apply the Woronowicz formulation of Tannaka-Krein duality to characterise the weight matrices that appear in these neural networks for any easy compact matrix quantum group. We show that compact matrix quantum group equivariant neural networks contain, as a subclass, all compact matrix group equivariant neural networks. Moreover, we obtain characterisations of the weight matrices for many compact matrix group equivariant neural networks that have not previously appeared in the machine learning literature.
    Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage. (arXiv:2302.02392v2 [cs.LG] UPDATED)
    In offline reinforcement learning (RL) we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, taking the form of assuming some coverage, realizability, Bellman completeness, and/or hard margin (gap). In this work we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage of just a single comparator policy, and realizability of soft (entropy-regularized) Q-function of the single policy and a related function defined as a saddle point of certain minimax optimization problem. This offers refined and generally more lax conditions for offline RL. We further show an analogous result for vanilla Q-functions under a soft margin condition. To attain these guarantees, we leverage novel minimax learning algorithms to accurately estimate soft or vanilla Q-functions with $L^2$-convergence guarantees. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
    Reformulating van Rijsbergen's $F_{\beta}$ metric for weighted binary cross-entropy. (arXiv:2210.16458v3 [stat.ML] UPDATED)
    The separation of performance metrics from gradient based loss functions may not always give optimal results and may miss vital aggregate information. This paper investigates incorporating a performance metric alongside differentiable loss functions to inform training outcomes. The goal is to guide model performance and interpretation by assuming statistical distributions on this performance metric for dynamic weighting. The focus is on van Rijsbergens $F_{\beta}$ metric -- a popular choice for gauging classification performance. Through distributional assumptions on the $F_{\beta}$, an intermediary link can be established to the standard binary cross-entropy via dynamic penalty weights. First, the $F_{\beta}$ metric is reformulated to facilitate assuming statistical distributions with accompanying proofs for the cumulative density function. These probabilities are used within a knee curve algorithm to find an optimal $\beta$ or $\beta_{opt}$. This $\beta_{opt}$ is used as a weight or penalty in the proposed weighted binary cross-entropy. Experimentation on publicly available data along with benchmark analysis mostly yields better and interpretable results as compared to the baseline for both imbalanced and balanced classes. For example, for the IMDB text data with known labeling errors, a 14% boost in $F_1$ score is shown. The results also reveal commonalities between the penalty model families derived in this paper and the suitability of recall-centric or precision-centric parameters used in the optimization. The flexibility of this methodology can enhance interpretation.
    From Wide to Deep: Dimension Lifting Network for Parameter-efficient Knowledge Graph Embedding. (arXiv:2303.12816v3 [cs.LG] UPDATED)
    Knowledge graph embedding (KGE) that maps entities and relations into vector representations is essential for downstream applications. Conventional KGE methods require high-dimensional representations to learn the complex structure of knowledge graph, but lead to oversized model parameters. Recent advances reduce parameters by low-dimensional entity representations, while developing techniques (e.g., knowledge distillation or reinvented representation forms) to compensate for reduced dimension. However, such operations introduce complicated computations and model designs that may not benefit large knowledge graphs. To seek a simple strategy to improve the parameter efficiency of conventional KGE models, we take inspiration from that deeper neural networks require exponentially fewer parameters to achieve expressiveness comparable to wider networks for compositional structures. We view all entity representations as a single-layer embedding network, and conventional KGE methods that adopt high-dimensional entity representations equal widening the embedding network to gain expressiveness. To achieve parameter efficiency, we instead propose a deeper embedding network for entity representations, i.e., a narrow entity embedding layer plus a multi-layer dimension lifting network (LiftNet). Experiments on three public datasets show that by integrating LiftNet, four conventional KGE methods with 16-dimensional representations achieve comparable link prediction accuracy as original models that adopt 512-dimensional representations, saving 68.4% to 96.9% parameters.
    Parts of Speech-Grounded Subspaces in Vision-Language Models. (arXiv:2305.14053v2 [cs.CV] UPDATED)
    Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP's joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie. What's more, we show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances (e.g. artists' painting styles), which enables the selective removal of entire visual themes from CLIP-based text-to-image synthesis. We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists' styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification.
    Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects. (arXiv:2311.02332v2 [cs.LG] UPDATED)
    Machine learning (ML) applications in medical artificial intelligence (AI) systems have shifted from traditional and statistical methods to increasing application of deep learning models. This survey navigates the current landscape of multimodal ML, focusing on its profound impact on medical image analysis and clinical decision support systems. Emphasizing challenges and innovations in addressing multimodal representation, fusion, translation, alignment, and co-learning, the paper explores the transformative potential of multimodal models for clinical predictions. It also questions practical implementation of such models, bringing attention to the dynamics between decision support systems and healthcare providers. Despite advancements, challenges such as data biases and the scarcity of "big data" in many biomedical domains persist. We conclude with a discussion on effective innovation and collaborative efforts to further the miss
    Learning RL-Policies for Joint Beamforming Without Exploration: A Batch Constrained Off-Policy Approach. (arXiv:2310.08660v2 [cs.LG] UPDATED)
    In this work, we consider the problem of network parameter optimization for rate maximization. We frame this as a joint optimization problem of power control, beam forming, and interference cancellation. We consider the setting where multiple Base Stations (BSs) communicate with multiple user equipment (UEs). Because of the exponential computational complexity of brute force search, we instead solve this nonconvex optimization problem using deep reinforcement learning (RL) techniques. Modern communication systems are notorious for their difficulty in exactly modeling their behavior. This limits us in using RL-based algorithms as interaction with the environment is needed for the agent to explore and learn efficiently. Further, it is ill-advised to deploy the algorithm in the real world for exploration and learning because of the high cost of failure. In contrast to the previous RL-based solutions proposed, such as deep-Q network (DQN) based control, we suggest an offline model-based approach. We specifically consider discrete batch-constrained deep Q-learning (BCQ) and show that performance similar to DQN can be achieved with only a fraction of the data without exploring. This maximizes sample efficiency and minimizes risk in deploying a new algorithm to commercial networks. We provide the entire project resource, including code and data, at the following link: https://github.com/Heasung-Kim/ safe-rl-deployment-for-5g.
    HyperMixer: An MLP-based Low Cost Alternative to Transformers. (arXiv:2203.03691v3 [cs.CL] UPDATED)
    Transformer-based architectures are the model of choice for natural language understanding, but they come at a significant cost, as they have quadratic complexity in the input length, require a lot of training data, and can be difficult to tune. In the pursuit of lower costs, we investigate simple MLP-based architectures. We find that existing architectures such as MLPMixer, which achieves token mixing through a static MLP applied to each feature independently, are too detached from the inductive biases required for natural language understanding. In this paper, we propose a simple variant, HyperMixer, which forms the token mixing MLP dynamically using hypernetworks. Empirically, we demonstrate that our model performs better than alternative MLP-based models, and on par with Transformers. In contrast to Transformers, HyperMixer achieves these results at substantially lower costs in terms of processing time, training data, and hyperparameter tuning.
    ChatGPT Prompting Cannot Estimate Predictive Uncertainty in High-Resource Languages. (arXiv:2311.06427v1 [cs.CL])
    ChatGPT took the world by storm for its impressive abilities. Due to its release without documentation, scientists immediately attempted to identify its limits, mainly through its performance in natural language processing (NLP) tasks. This paper aims to join the growing literature regarding ChatGPT's abilities by focusing on its performance in high-resource languages and on its capacity to predict its answers' accuracy by giving a confidence level. The analysis of high-resource languages is of interest as studies have shown that low-resource languages perform worse than English in NLP tasks, but no study so far has analysed whether high-resource languages perform as well as English. The analysis of ChatGPT's confidence calibration has not been carried out before either and is critical to learn about ChatGPT's trustworthiness. In order to study these two aspects, five high-resource languages and two NLP tasks were chosen. ChatGPT was asked to perform both tasks in the five languages and to give a numerical confidence value for each answer. The results show that all the selected high-resource languages perform similarly and that ChatGPT does not have a good confidence calibration, often being overconfident and never giving low confidence values.
    Asymmetric Contrastive Multimodal Learning for Advancing Chemical Understanding. (arXiv:2311.06456v1 [cs.LG])
    The versatility of multimodal deep learning holds tremendous promise for advancing scientific research and practical applications. As this field continues to evolve, the collective power of cross-modal analysis promises to drive transformative innovations, leading us to new frontiers in chemical understanding and discovery. Hence, we introduce Asymmetric Contrastive M}ultimodal Learning (ACML) as a novel approach tailored for molecules, showcasing its potential to advance the field of chemistry. ACML harnesses the power of effective asymmetric contrastive learning to seamlessly transfer information from various chemical modalities to molecular graph representations. By combining pre-trained chemical unimodal encoders and a shallow-designed graph encoder, ACML facilitates the assimilation of coordinated chemical semantics from different modalities, leading to comprehensive representation learning with efficient training. This innovative framework enhances the interpretability of learned representations and bolsters the expressive power of graph neural networks. Through practical tasks such as isomer discrimination and uncovering crucial chemical properties for drug discovery, ACML exhibits its capability to revolutionize chemical research and applications, providing a deeper understanding of chemical semantics of different modalities.
    ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models. (arXiv:2310.09624v2 [cs.CL] UPDATED)
    As large language models are integrated into society, robustness toward a suite of prompts is increasingly important to maintain reliability in a high-variance environment.Robustness evaluations must comprehensively encapsulate the various settings in which a user may invoke an intelligent system. This paper proposes ASSERT, Automated Safety Scenario Red Teaming, consisting of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. For robust safety evaluation, we apply these methods in the critical domain of AI safety to algorithmically generate a test suite of prompts covering diverse robustness settings -- semantic equivalence, related scenarios, and adversarial. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. Despite dedicated safeguards in existing state-of-the-art models, we find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings, raising concerns for users' physical safety.
    AutoAlign: Fully Automatic and Effective Knowledge Graph Alignment enabled by Large Language Models. (arXiv:2307.11772v3 [cs.IR] UPDATED)
    The task of entity alignment between knowledge graphs (KGs) aims to identify every pair of entities from two different KGs that represent the same entity. Many machine learning-based methods have been proposed for this task. However, to our best knowledge, existing methods all require manually crafted seed alignments, which are expensive to obtain. In this paper, we propose the first fully automatic alignment method named AutoAlign, which does not require any manually crafted seed alignments. Specifically, for predicate embeddings, AutoAlign constructs a predicate-proximity-graph with the help of large language models to automatically capture the similarity between predicates across two KGs. For entity embeddings, AutoAlign first computes the entity embeddings of each KG independently using TransE, and then shifts the two KGs' entity embeddings into the same vector space by computing the similarity between entities based on their attributes. Thus, both predicate alignment and entity alignment can be done without manually crafted seed alignments. AutoAlign is not only fully automatic, but also highly effective. Experiments using real-world KGs show that AutoAlign improves the performance of entity alignment significantly compared to state-of-the-art methods.
    Brownian Noise Reduction: Maximizing Privacy Subject to Accuracy Constraints. (arXiv:2206.07234v4 [cs.LG] UPDATED)
    There is a disconnect between how researchers and practitioners handle privacy-utility tradeoffs. Researchers primarily operate from a privacy first perspective, setting strict privacy requirements and minimizing risk subject to these constraints. Practitioners often desire an accuracy first perspective, possibly satisfied with the greatest privacy they can get subject to obtaining sufficiently small error. Ligett et al. have introduced a "noise reduction" algorithm to address the latter perspective. The authors show that by adding correlated Laplace noise and progressively reducing it on demand, it is possible to produce a sequence of increasingly accurate estimates of a private parameter while only paying a privacy cost for the least noisy iterate released. In this work, we generalize noise reduction to the setting of Gaussian noise, introducing the Brownian mechanism. The Brownian mechanism works by first adding Gaussian noise of high variance corresponding to the final point of a simulated Brownian motion. Then, at the practitioner's discretion, noise is gradually decreased by tracing back along the Brownian path to an earlier time. Our mechanism is more naturally applicable to the common setting of bounded $\ell_2$-sensitivity, empirically outperforms existing work on common statistical tasks, and provides customizable control of privacy loss over the entire interaction with the practitioner. We complement our Brownian mechanism with ReducedAboveThreshold, a generalization of the classical AboveThreshold algorithm that provides adaptive privacy guarantees. Overall, our results demonstrate that one can meet utility constraints while still maintaining strong levels of privacy.
    Learning nonparametric ordinary differential equations from noisy data. (arXiv:2206.15215v3 [stat.ML] UPDATED)
    Learning nonparametric systems of Ordinary Differential Equations (ODEs) dot x = f(t,x) from noisy data is an emerging machine learning topic. We use the well-developed theory of Reproducing Kernel Hilbert Spaces (RKHS) to define candidates for f for which the solution of the ODE exists and is unique. Learning f consists of solving a constrained optimization problem in an RKHS. We propose a penalty method that iteratively uses the Representer theorem and Euler approximations to provide a numerical solution. We prove a generalization bound for the L2 distance between x and its estimator and provide experimental comparisons with the state-of-the-art.
    Similarity-based cooperative equilibrium. (arXiv:2211.14468v2 [cs.GT] UPDATED)
    As machine learning agents act more autonomously in the world, they will increasingly interact with each other. Unfortunately, in many social dilemmas like the one-shot Prisoner's Dilemma, standard game theory predicts that ML agents will fail to cooperate with each other. Prior work has shown that one way to enable cooperative outcomes in the one-shot Prisoner's Dilemma is to make the agents mutually transparent to each other, i.e., to allow them to access one another's source code (Rubinstein 1998, Tennenholtz 2004) -- or weights in the case of ML agents. However, full transparency is often unrealistic, whereas partial transparency is commonplace. Moreover, it is challenging for agents to learn their way to cooperation in the full transparency setting. In this paper, we introduce a more realistic setting in which agents only observe a single number indicating how similar they are to each other. We prove that this allows for the same set of cooperative outcomes as the full transparency setting. We also demonstrate experimentally that cooperation can be learned using simple ML methods.
    Bias-inducing geometries: an exactly solvable data model with fairness implications. (arXiv:2205.15935v3 [cs.LG] UPDATED)
    Machine learning (ML) may be oblivious to human bias but it is not immune to its perpetuation. Marginalisation and iniquitous group representation are often traceable in the very data used for training, and may be reflected or even enhanced by the learning models. In the present work, we aim at clarifying the role played by data geometry in the emergence of ML bias. We introduce an exactly solvable high-dimensional model of data imbalance, where parametric control over the many bias-inducing factors allows for an extensive exploration of the bias inheritance mechanism. Through the tools of statistical physics, we analytically characterise the typical properties of learning models trained in this synthetic framework and obtain exact predictions for the observables that are commonly employed for fairness assessment. Despite the simplicity of the data model, we retrace and unpack typical unfairness behaviour observed on real-world datasets. We also obtain a detailed analytical characterisation of a class of bias mitigation strategies. We first consider a basic loss-reweighing scheme, which allows for an implicit minimisation of different unfairness metrics, and quantify the incompatibilities between some existing fairness criteria. Then, we consider a novel mitigation strategy based on a matched inference approach, consisting in the introduction of coupled learning models. Our theoretical analysis of this approach shows that the coupled strategy can strike superior fairness-accuracy trade-offs.
    Learning and DiSentangling Patient Static Information from Time-series Electronic HEalth Record (STEER). (arXiv:2309.11373v2 [cs.LG] UPDATED)
    Recent work in machine learning for healthcare has raised concerns about patient privacy and algorithmic fairness. For example, previous work has shown that patient self-reported race can be predicted from medical data that does not explicitly contain racial information. However, the extent of data identification is unknown, and we lack ways to develop models whose outcomes are minimally affected by such information. Here we systematically investigated the ability of time-series electronic health record data to predict patient static information. We found that not only the raw time-series data, but also learned representations from machine learning models, can be trained to predict a variety of static information with area under the receiver operating characteristic curve as high as 0.851 for biological sex, 0.869 for binarized age and 0.810 for self-reported race. Such high predictive performance can be extended to a wide range of comorbidity factors and exists even when the model was trained for different tasks, using different cohorts, using different model architectures and databases. Given the privacy and fairness concerns these findings pose, we develop a variational autoencoder-based approach that learns a structured latent space to disentangle patient-sensitive attributes from time-series data. Our work thoroughly investigates the ability of machine learning models to encode patient static information from time-series electronic health records and introduces a general approach to protect patient-sensitive attribute information for downstream tasks.
    Bigger, Better, Faster: Human-level Atari with human-level efficiency. (arXiv:2305.19452v3 [cs.LG] UPDATED)
    We introduce a value-based RL agent, which we call BBF, that achieves super-human performance in the Atari 100K benchmark. BBF relies on scaling the neural networks used for value estimation, as well as a number of other design choices that enable this scaling in a sample-efficient manner. We conduct extensive analyses of these design choices and provide insights for future work. We end with a discussion about updating the goalposts for sample-efficient RL research on the ALE. We make our code and data publicly available at https://github.com/google-research/google-research/tree/master/bigger_better_faster.  ( 2 min )
    CompCodeVet: A Compiler-guided Validation and Enhancement Approach for Code Dataset. (arXiv:2311.06505v1 [cs.LG])
    Large language models (LLMs) have become increasingly prominent in academia and industry due to their remarkable performance in diverse applications. As these models evolve with increasing parameters, they excel in tasks like sentiment analysis and machine translation. However, even models with billions of parameters face challenges in tasks demanding multi-step reasoning. Code generation and comprehension, especially in C and C++, emerge as significant challenges. While LLMs trained on code datasets demonstrate competence in many tasks, they struggle with rectifying non-compilable C and C++ code. Our investigation attributes this subpar performance to two primary factors: the quality of the training dataset and the inherent complexity of the problem which demands intricate reasoning. Existing "Chain of Thought" (CoT) prompting techniques aim to enhance multi-step reasoning. This approach, however, retains the limitations associated with the latent drawbacks of LLMs. In this work, we propose CompCodeVet, a compiler-guided CoT approach to produce compilable code from non-compilable ones. Diverging from the conventional approach of utilizing larger LLMs, we employ compilers as a teacher to establish a more robust zero-shot thought process. The evaluation of CompCodeVet on two open-source code datasets shows that CompCodeVet has the ability to improve the training dataset quality for LLMs.
    A 3D Conditional Diffusion Model for Image Quality Transfer -- An Application to Low-Field MRI. (arXiv:2311.06631v1 [eess.IV])
    Low-field (LF) MRI scanners (<1T) are still prevalent in settings with limited resources or unreliable power supply. However, they often yield images with lower spatial resolution and contrast than high-field (HF) scanners. This quality disparity can result in inaccurate clinician interpretations. Image Quality Transfer (IQT) has been developed to enhance the quality of images by learning a mapping function between low and high-quality images. Existing IQT models often fail to restore high-frequency features, leading to blurry output. In this paper, we propose a 3D conditional diffusion model to improve 3D volumetric data, specifically LF MR images. Additionally, we incorporate a cross-batch mechanism into the self-attention and padding of our network, ensuring broader contextual awareness even under small 3D patches. Experiments on the publicly available Human Connectome Project (HCP) dataset for IQT and brain parcellation demonstrate that our model outperforms existing methods both quantitatively and qualitatively. The code is publicly available at \url{https://github.com/edshkim98/DiffusionIQT}.
    Generative Learning of Heterogeneous Tail Dependence. (arXiv:2011.13132v2 [cs.LG] UPDATED)
    We propose a multivariate generative model to capture the complex dependence structure often encountered in business and financial data. Our model features heterogeneous and asymmetric tail dependence between all pairs of individual dimensions while also allowing heterogeneity and asymmetry in the tails of the marginals. A significant merit of our model structure is that it is not prone to error propagation in the parameter estimation process, hence very scalable, as the dimensions of datasets grow large. However, the likelihood methods are infeasible for parameter estimation in our case due to the lack of a closed-form density function. Instead, we devise a novel moment learning algorithm to learn the parameters. To demonstrate the effectiveness of the model and its estimator, we test them on simulated as well as real-world datasets. Results show that this framework gives better finite-sample performance compared to the copula-based benchmarks as well as recent similar models.
    A Saliency-based Clustering Framework for Identifying Aberrant Predictions. (arXiv:2311.06454v1 [cs.LG])
    In machine learning, classification tasks serve as the cornerstone of a wide range of real-world applications. Reliable, trustworthy classification is particularly intricate in biomedical settings, where the ground truth is often inherently uncertain and relies on high degrees of human expertise for labeling. Traditional metrics such as precision and recall, while valuable, are insufficient for capturing the nuances of these ambiguous scenarios. Here we introduce the concept of aberrant predictions, emphasizing that the nature of classification errors is as critical as their frequency. We propose a novel, efficient training methodology aimed at both reducing the misclassification rate and discerning aberrant predictions. Our framework demonstrates a substantial improvement in model performance, achieving a 20\% increase in precision. We apply this methodology to the less-explored domain of veterinary radiology, where the stakes are high but have not been as extensively studied compared to human medicine. By focusing on the identification and mitigation of aberrant predictions, we enhance the utility and trustworthiness of machine learning classifiers in high-stakes, real-world scenarios, including new applications in the veterinary world.
    Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models. (arXiv:2311.06322v1 [cs.CV])
    Diffusion models have achieved great success due to their remarkable generation ability. However, their high computational overhead is still a troublesome problem. Recent studies have leveraged post-training quantization (PTQ) to compress diffusion models. However, most of them only focus on unconditional models, leaving the quantization of widely used large pretrained text-to-image models, e.g., Stable Diffusion, largely unexplored. In this paper, we propose a novel post-training quantization method PCR (Progressive Calibration and Relaxing) for text-to-image diffusion models, which consists of a progressive calibration strategy that considers the accumulated quantization error across timesteps, and an activation relaxing strategy that improves the performance with negligible cost. Additionally, we demonstrate the previous metrics for text-to-image diffusion model quantization are not accurate due to the distribution gap. To tackle the problem, we propose a novel QDiffBench benchmark, which utilizes data in the same domain for more accurate evaluation. Besides, QDiffBench also considers the generalization performance of the quantized model outside the calibration dataset. Extensive experiments on Stable Diffusion and Stable Diffusion XL demonstrate the superiority of our method and benchmark. Moreover, we are the first to achieve quantization for Stable Diffusion XL while maintaining the performance.
    Online Continual Learning via Logit Adjusted Softmax. (arXiv:2311.06460v1 [cs.LG])
    Online continual learning is a challenging problem where models must learn from a non-stationary data stream while avoiding catastrophic forgetting. Inter-class imbalance during training has been identified as a major cause of forgetting, leading to model prediction bias towards recently learned classes. In this paper, we theoretically analyze that inter-class imbalance is entirely attributed to imbalanced class-priors, and the function learned from intra-class intrinsic distributions is the Bayes-optimal classifier. To that end, we present that a simple adjustment of model logits during training can effectively resist prior class bias and pursue the corresponding Bayes-optimum. Our proposed method, Logit Adjusted Softmax, can mitigate the impact of inter-class imbalance not only in class-incremental but also in realistic general setups, with little additional computational cost. We evaluate our approach on various benchmarks and demonstrate significant performance improvements compared to prior arts. For example, our approach improves the best baseline by 4.6% on CIFAR10.
    Quantum Neural Networks for Power Flow Analysis. (arXiv:2311.06293v1 [quant-ph])
    This paper explores the potential application of quantum and hybrid quantum-classical neural networks in power flow analysis. Experiments are conducted using two small-size datasets based on the IEEE 4-bus and 33-bus test systems. A systematic performance comparison is also conducted among quantum, hybrid quantum-classical, and classical neural networks. The comparison is based on (i) generalization ability, (ii) robustness, (iii) training dataset size needed, (iv) training error. (v) training computational time, and (vi) training process stability. The results show that the developed quantum-classical neural network outperforms both quantum and classical neural networks, and hence can improve deep learning-based power flow analysis in the noisy-intermediate-scale quantum (NISQ) era.
    Computer Vision for Particle Size Analysis of Coarse-Grained Soils. (arXiv:2311.06613v1 [cs.CV])
    Particle size analysis (PSA) is a fundamental technique for evaluating the physical characteristics of soils. However, traditional methods like sieving can be time-consuming and labor-intensive. In this study, we present a novel approach that utilizes computer vision (CV) and the Python programming language for PSA of coarse-grained soils, employing a standard mobile phone camera. By eliminating the need for a high-performance camera, our method offers convenience and cost savings. Our methodology involves using the OPENCV library to detect and measure soil particles in digital photographs taken under ordinary lighting conditions. For accurate particle size determination, a calibration target with known dimensions is placed on a plain paper alongside 20 different sand samples. The proposed method is compared with traditional sieve analysis and exhibits satisfactory performance for soil particles larger than 2 mm, with a mean absolute percent error (MAPE) of approximately 6%. However, particles smaller than 2 mm result in higher MAPE, reaching up to 60%. To address this limitation, we recommend using a higher-resolution camera to capture images of the smaller soil particles. Furthermore, we discuss the advantages, limitations, and potential future improvements of our method. Remarkably, the program can be executed on a mobile phone, providing immediate results without the need to send soil samples to a laboratory. This field-friendly feature makes our approach highly convenient for on-site usage, outside of a traditional laboratory setting. Ultimately, this novel method represents an initial disruption to the industry, enabling efficient particle size analysis of soil without the reliance on laboratory-based sieve analysis. KEYWORDS: Computer vision, Grain size, ARUCO
    i-Razor: A Differentiable Neural Input Razor for Feature Selection and Dimension Search in DNN-Based Recommender Systems. (arXiv:2204.00281v2 [cs.IR] UPDATED)
    Input features play a crucial role in DNN-based recommender systems with thousands of categorical and continuous fields from users, items, contexts, and interactions. Noisy features and inappropriate embedding dimension assignments can deteriorate the performance of recommender systems and introduce unnecessary complexity in model training and online serving. Optimizing the input configuration of DNN models, including feature selection and embedding dimension assignment, has become one of the essential topics in feature engineering. However, in existing industrial practices, feature selection and dimension search are optimized sequentially, i.e., feature selection is performed first, followed by dimension search to determine the optimal dimension size for each selected feature. Such a sequential optimization mechanism increases training costs and risks generating suboptimal input configurations. To address this problem, we propose a differentiable neural input razor (i-Razor) that enables joint optimization of feature selection and dimension search. Concretely, we introduce an end-to-end differentiable model to learn the relative importance of different embedding regions of each feature. Furthermore, a flexible pruning algorithm is proposed to achieve feature filtering and dimension derivation simultaneously. Extensive experiments on two large-scale public datasets in the Click-Through-Rate (CTR) prediction task demonstrate the efficacy and superiority of i-Razor in balancing model complexity and performance.
    Understanding Grokking Through A Robustness Viewpoint. (arXiv:2311.06597v1 [cs.LG])
    Recently, an unusual phenomenon called grokking has gained much attention, where sometimes a neural network generalizes long after it perfectly fits the training data. We try to understand this seemingly strange phenomenon using the robustness of the neural network. Using a robustness viewpoint, we show that the popular $l_2$ weight norm (metric) of the neural network is actually a sufficient condition for grokking. As we also empirically find that $l_2$ norm correlates with grokking on the test data not in a timely way, we propose new metrics based on robustness and information theory and find that our new metrics correlate well with the grokking phenomenon. Based on the previous observations, we propose methods to speed up the generalization process. In addition, we examine the standard training process on modulo addition dataset and find that it hardly learns other basic group operations before grokking, including the commutative law. Interestingly, the speed up of generalization when using our proposed method can be partially explained by learning the commutative law, a necessary condition when the model groks on test dataset.
    Forte: An Interactive Visual Analytic Tool for Trust-Augmented Net Load Forecasting. (arXiv:2311.06413v1 [cs.HC])
    Accurate net load forecasting is vital for energy planning, aiding decisions on trade and load distribution. However, assessing the performance of forecasting models across diverse input variables, like temperature and humidity, remains challenging, particularly for eliciting a high degree of trust in the model outcomes. In this context, there is a growing need for data-driven technological interventions to aid scientists in comprehending how models react to both noisy and clean input variables, thus shedding light on complex behaviors and fostering confidence in the outcomes. In this paper, we present Forte, a visual analytics-based application to explore deep probabilistic net load forecasting models across various input variables and understand the error rates for different scenarios. With carefully designed visual interventions, this web-based interface empowers scientists to derive insights about model performance by simulating diverse scenarios, facilitating an informed decision-making process. We discuss observations made using Forte and demonstrate the effectiveness of visualization techniques to provide valuable insights into the correlation between weather inputs and net load forecasts, ultimately advancing grid capabilities by improving trust in forecasting models.
    Comprehensive Comparison of Deep Learning Models for Lung and COVID-19 Lesion Segmentation in CT scans. (arXiv:2009.06412v7 [eess.IV] UPDATED)
    Recently there has been an explosion in the use of Deep Learning (DL) methods for medical image segmentation. However the field's reliability is hindered by the lack of a common base of reference for accuracy/performance evaluation and the fact that previous research uses different datasets for evaluation. In this paper, an extensive comparison of DL models for lung and COVID-19 lesion segmentation in Computerized Tomography (CT) scans is presented, which can also be used as a benchmark for testing medical image segmentation models. Four DL architectures (Unet, Linknet, FPN, PSPNet) are combined with 25 randomly initialized and pretrained encoders (variations of VGG, DenseNet, ResNet, ResNext, DPN, MobileNet, Xception, Inception-v4, EfficientNet), to construct 200 tested models. Three experimental setups are conducted for lung segmentation, lesion segmentation and lesion segmentation using the original lung masks. A public COVID-19 dataset with 100 CT scan images (80 for train, 20 for validation) is used for training/validation and a different public dataset consisting of 829 images from 9 CT scan volumes for testing. Multiple findings are provided including the best architecture-encoder models for each experiment as well as mean Dice results for each experiment, architecture and encoder independently. Finally, the upper bounds improvements when using lung masks as a preprocessing step or when using pretrained models are quantified. The source code and 600 pretrained models for the three experiments are provided, suitable for fine-tuning in experimental setups without GPU capabilities.  ( 3 min )
    OpenNDD: Open Set Recognition for Neurodevelopmental Disorders Detection. (arXiv:2306.16045v2 [cs.CV] UPDATED)
    Since the strong comorbid similarity in NDDs, such as attention-deficit hyperactivity disorder, can interfere with the accurate diagnosis of autism spectrum disorder (ASD), identifying unknown classes is extremely crucial and challenging from NDDs. We design a novel open set recognition framework for ASD-aided diagnosis (OpenNDD), which trains a model by combining autoencoder and adversarial reciprocal points learning to distinguish in-distribution and out-of-distribution categories as well as identify ASD accurately. Considering the strong similarities between NDDs, we present a joint scaling method by Min-Max scaling combined with Standardization (MMS) to increase the differences between classes for better distinguishing unknown NDDs. We conduct the experiments in the hybrid datasets from Autism Brain Imaging Data Exchange I (ABIDE I) and THE ADHD-200 SAMPLE (ADHD-200) with 791 samples from four sites and the results demonstrate the superiority on various metrics. Our OpenNDD achieves promising performance, where the accuracy is 77.38%, AUROC is 75.53% and the open set classification rate is as high as 59.43%.  ( 2 min )
    The AeroSonicDB (YPAD-0523) Dataset for Acoustic Detection and Classification of Aircraft. (arXiv:2311.06368v1 [cs.SD])
    The time and expense required to collect and label audio data has been a prohibitive factor in the availability of domain specific audio datasets. As the predictive specificity of a classifier depends on the specificity of the labels it is trained on, it follows that finely-labelled datasets are crucial for advances in machine learning. Aiming to stimulate progress in the field of machine listening, this paper introduces AeroSonicDB (YPAD-0523), a dataset of low-flying aircraft sounds for training acoustic detection and classification systems. This paper describes the method of exploiting ADS-B radio transmissions to passively collect and label audio samples. Provides a summary of the collated dataset. Presents baseline results from three binary classification models, then discusses the limitations of the current dataset and its future potential. The dataset contains 625 aircraft recordings ranging in event duration from 18 to 60 seconds, for a total of 8.87 hours of aircraft audio. These 625 samples feature 301 unique aircraft, each of which are supplied with 14 supplementary (non-acoustic) labels to describe the aircraft. The dataset also contains 3.52 hours of ambient background audio ("silence"), as a means to distinguish aircraft noise from other local environmental noises. Additionally, 6 hours of urban soundscape recordings (with aircraft annotations) are included as an ancillary method for evaluating model performance, and to provide a testing ground for real-time applications.
    The Exact Determinant of a Specific Class of Sparse Positive Definite Matrices. (arXiv:2311.06632v1 [stat.ML])
    For a specific class of sparse Gaussian graphical models, we provide a closed-form solution for the determinant of the covariance matrix. In our framework, the graphical interaction model (i.e., the covariance selection model) is equal to replacement product of $\mathcal{K}_{n}$ and $\mathcal{K}_{n-1}$, where $\mathcal{K}_n$ is the complete graph with $n$ vertices. Our analysis is based on taking the Fourier transform of the local factors of the model, which can be viewed as an application of the Normal Factor Graph Duality Theorem and holographic algorithms. The closed-form expression is obtained by applying the Matrix Determinant Lemma on the transformed graphical model. In this context, we will also define a notion of equivalence between two Gaussian graphical models.  ( 2 min )
    Community Recovery in the Geometric Block Model. (arXiv:2206.11303v2 [cs.SI] UPDATED)
    To capture inherent geometric features of many community detection problems, we propose to use a new random graph model of communities that we call a \textit{Geometric Block Model}. The geometric block model builds on the \emph{random geometric graphs} (Gilbert, 1961), one of the basic models of random graphs for spatial networks, in the same way that the well-studied stochastic block model builds on the Erd\H{o}s-R\'{en}yi random graphs. It is also a natural extension of random community models inspired by the recent theoretical and practical advancements in community detection. To analyze the geometric block model, we first provide new connectivity results for \emph{random annulus graphs} which are generalizations of random geometric graphs. The connectivity properties of geometric graphs have been studied since their introduction, and analyzing them has been difficult due to correlated edge formation. We then use the connectivity results of random annulus graphs to provide necessary and sufficient conditions for efficient recovery of communities for the geometric block model. We show that a simple triangle-counting algorithm to detect communities in the geometric block model is near-optimal. For this we consider two regimes of graph density. In the regime where the average degree of the graph grows logarithmically with number of vertices, we show that our algorithm performs extremely well, both theoretically and practically. In contrast, the triangle-counting algorithm is far from being optimum for the stochastic block model in the logarithmic degree regime. We also look at the regime where the average degree of the graph grows linearly with the number of vertices $n$, and hence to store the graph one needs $\Theta(n^2)$ memory. We show that our algorithm needs to store only $O(n \log n)$ edges in this regime to recover the latent communities.  ( 3 min )
    Instance-Dependent Generalization Bounds via Optimal Transport. (arXiv:2211.01258v4 [stat.ML] UPDATED)
    Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks. Since such bounds often hold uniformly over all parameters, they suffer from over-parametrization and fail to account for the strong inductive bias of initialization and stochastic gradient descent. As an alternative, we propose a novel optimal transport interpretation of the generalization problem. This allows us to derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space. Therefore, our bounds are agnostic to the parametrization of the model and work well when the number of training samples is much smaller than the number of parameters. With small modifications, our approach yields accelerated rates for data on low-dimensional manifolds and guarantees under distribution shifts. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.  ( 2 min )
    Provably Convergent Plug-and-Play Quasi-Newton Methods. (arXiv:2303.07271v4 [math.OC] UPDATED)
    Plug-and-Play (PnP) methods are a class of efficient iterative methods that aim to combine data fidelity terms and deep denoisers using classical optimization algorithms, such as ISTA or ADMM, with applications in inverse problems and imaging. Provable PnP methods are a subclass of PnP methods with convergence guarantees, such as fixed point convergence or convergence to critical points of some energy function. Many existing provable PnP methods impose heavy restrictions on the denoiser or fidelity function, such as non-expansiveness or strict convexity, respectively. In this work, we propose a novel algorithmic approach incorporating quasi-Newton steps into a provable PnP framework based on proximal denoisers, resulting in greatly accelerated convergence while retaining light assumptions on the denoiser. By characterizing the denoiser as the proximal operator of a weakly convex function, we show that the fixed points of the proposed quasi-Newton PnP algorithm are critical points of a weakly convex function. Numerical experiments on image deblurring and super-resolution demonstrate 2--8x faster convergence as compared to other provable PnP methods with similar reconstruction quality.  ( 2 min )
    milliFlow: Scene Flow Estimation on mmWave Radar Point Cloud for Human Motion Sensing. (arXiv:2306.17010v3 [cs.CV] UPDATED)
    Approaching the era of ubiquitous computing, human motion sensing plays a crucial role in smart systems for decision making, user interaction, and personalized services. Extensive research has been conducted on human tracking, pose estimation, gesture recognition, and activity recognition, which are predominantly based on cameras in traditional methods. However, the intrusive nature of cameras limits their use in smart home applications. To address this, mmWave radars have gained popularity due to their privacy-friendly features. In this work, we propose milliFlow, a novel deep learning method for scene flow estimation as a complementary motion information for mmWave point cloud, serving as an intermediate level of features and directly benefiting downstream human motion sensing tasks. Experimental results demonstrate the superior performance of our method with an average 3D endpoint error of 4.6cm, significantly surpassing the competing approaches. Furthermore, by incorporating scene flow information, we achieve remarkable improvements in human activity recognition, human parsing, and human body part tracking. To foster further research in this area, we will provide our codebase and dataset for open access upon acceptance.  ( 2 min )
    FedBug: A Bottom-Up Gradual Unfreezing Framework for Federated Learning. (arXiv:2307.10317v2 [cs.LG] UPDATED)
    Federated Learning (FL) offers a collaborative training framework, allowing multiple clients to contribute to a shared model without compromising data privacy. Due to the heterogeneous nature of local datasets, updated client models may overfit and diverge from one another, commonly known as the problem of client drift. In this paper, we propose FedBug (Federated Learning with Bottom-Up Gradual Unfreezing), a novel FL framework designed to effectively mitigate client drift. FedBug adaptively leverages the client model parameters, distributed by the server at each global round, as the reference points for cross-client alignment. Specifically, on the client side, FedBug begins by freezing the entire model, then gradually unfreezes the layers, from the input layer to the output layer. This bottom-up approach allows models to train the newly thawed layers to project data into a latent space, wherein the separating hyperplanes remain consistent across all clients. We theoretically analyze FedBug in a novel over-parameterization FL setup, revealing its superior convergence rate compared to FedAvg. Through comprehensive experiments, spanning various datasets, training conditions, and network architectures, we validate the efficacy of FedBug. Our contributions encompass a novel FL framework, theoretical analysis, and empirical validation, demonstrating the wide potential and applicability of FedBug.  ( 2 min )
    BB-ML: Basic Block Performance Prediction using Machine Learning Techniques. (arXiv:2202.07798v3 [cs.LG] UPDATED)
    Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a large code into manageable pieces. We extrapolate the basic block execution counts of GPU applications and use them for predicting the performance for large input sizes from the counts of smaller input sizes. We train a Poisson Neural Network (PNN) model using random input values as well as the lowest input values of the application to learn the relationship between inputs and basic block counts. Experimental results show that the model can accurately predict the basic block execution counts of 16 GPU benchmarks. We achieve an accuracy of 93.5% in extrapolating the basic block counts for large input sets when trained on smaller input sets and an accuracy of 97.7% in predicting basic block counts on random instances. In a case study, we apply the ML model to CUDA GPU benchmarks for performance prediction across a spectrum of applications. We use a variety of metrics for evaluation, including global memory requests and the active cycles of tensor cores, ALU, and FMA units. Results demonstrate the model's capability of predicting the performance of large datasets with an average error rate of 0.85% and 0.17% for global and shared memory requests, respectively. Additionally, to address the utilization of the main functional units in Ampere architecture GPUs, we calculate the active cycles for tensor cores, ALU, FMA, and FP64 units and achieve an average error of 2.3% and 10.66% for ALU and FMA units while the maximum observed error across all tested applications and units reaches 18.5%.  ( 3 min )
    A comprehensive analysis of concept drift locality in data streams. (arXiv:2311.06396v1 [cs.LG])
    Adapting to drifting data streams is a significant challenge in online learning. Concept drift must be detected for effective model adaptation to evolving data properties. Concept drift can impact the data distribution entirely or partially, which makes it difficult for drift detectors to accurately identify the concept drift. Despite the numerous concept drift detectors in the literature, standardized procedures and benchmarks for comprehensive evaluation considering the locality of the drift are lacking. We present a novel categorization of concept drift based on its locality and scale. A systematic approach leads to a set of 2,760 benchmark problems, reflecting various difficulty levels following our proposed categorization. We conduct a comparative assessment of 9 state-of-the-art drift detectors across diverse difficulties, highlighting their strengths and weaknesses for future research. We examine how drift locality influences the classifier performance and propose strategies for different drift categories to minimize the recovery time. Lastly, we provide lessons learned and recommendations for future concept drift research. Our benchmark data streams and experiments are publicly available at https://github.com/gabrieljaguiar/locality-concept-drift.  ( 2 min )
    RankSEG: A Consistent Ranking-based Framework for Segmentation. (arXiv:2206.13086v3 [stat.ML] UPDATED)
    Segmentation has emerged as a fundamental field of computer vision and natural language processing, which assigns a label to every pixel/feature to extract regions of interest from an image/text. To evaluate the performance of segmentation, the Dice and IoU metrics are used to measure the degree of overlap between the ground truth and the predicted segmentation. In this paper, we establish a theoretical foundation of segmentation with respect to the Dice/IoU metrics, including the Bayes rule and Dice-/IoU-calibration, analogous to classification-calibration or Fisher consistency in classification. We prove that the existing thresholding-based framework with most operating losses are not consistent with respect to the Dice/IoU metrics, and thus may lead to a suboptimal solution. To address this pitfall, we propose a novel consistent ranking-based framework, namely RankDice/RankIoU, inspired by plug-in rules of the Bayes segmentation rule. Three numerical algorithms with GPU parallel execution are developed to implement the proposed framework in large-scale and high-dimensional segmentation. We study statistical properties of the proposed framework. We show it is Dice-/IoU-calibrated, and its excess risk bounds and the rate of convergence are also provided. The numerical effectiveness of RankDice/mRankDice is demonstrated in various simulated examples and Fine-annotated CityScapes, Pascal VOC and Kvasir-SEG datasets with state-of-the-art deep learning architectures.  ( 3 min )
    Faster Algorithms for Structured Linear and Kernel Support Vector Machines. (arXiv:2307.07735v2 [math.OC] UPDATED)
    Quadratic programming is a ubiquitous prototype in convex programming. Many combinatorial optimizations on graphs and machine learning problems can be formulated as quadratic programming; for example, Support Vector Machines (SVMs). Linear and kernel SVMs have been among the most popular models in machine learning over the past three decades, prior to the deep learning era. Generally, a quadratic program has an input size of $\Theta(n^2)$, where $n$ is the number of variables. Assuming the Strong Exponential Time Hypothesis ($\textsf{SETH}$), it is known that no $O(n^{2-o(1)})$ algorithm exists (Backurs, Indyk, and Schmidt, NIPS'17). However, problems such as SVMs usually feature much smaller input sizes: one is given $n$ data points, each of dimension $d$, with $d \ll n$. Furthermore, SVMs are variants with only $O(1)$ linear constraints. This suggests that faster algorithms are feasible, provided the program exhibits certain underlying structures. In this work, we design the first nearly-linear time algorithm for solving quadratic programs whenever the quadratic objective has small treewidth or admits a low-rank factorization, and the number of linear constraints is small. Consequently, we obtain a variety of results for SVMs: * For linear SVM, where the quadratic constraint matrix has treewidth $\tau$, we can solve the corresponding program in time $\widetilde O(n\tau^{(\omega+1)/2}\log(1/\epsilon))$; * For linear SVM, where the quadratic constraint matrix admits a low-rank factorization of rank-$k$, we can solve the corresponding program in time $\widetilde O(nk^{(\omega+1)/2}\log(1/\epsilon))$; * For Gaussian kernel SVM, where the data dimension $d = \Theta(\log n)$ and the squared dataset radius is small, we can solve it in time $O(n^{1+o(1)}\log(1/\epsilon))$. We also prove that when the squared dataset radius is large, then $\Omega(n^{2-o(1)})$ time is required.  ( 3 min )
    A Data-driven Deep Learning Approach for Bitcoin Price Forecasting. (arXiv:2311.06280v1 [q-fin.ST])
    Bitcoin as a cryptocurrency has been one of the most important digital coins and the first decentralized digital currency. Deep neural networks, on the other hand, has shown promising results recently; however, we require huge amount of high-quality data to leverage their power. There are some techniques such as augmentation that can help us with increasing the dataset size, but we cannot exploit them on historical bitcoin data. As a result, we propose a shallow Bidirectional-LSTM (Bi-LSTM) model, fed with feature engineered data using our proposed method to forecast bitcoin closing prices in a daily time frame. We compare the performance with that of other forecasting methods, and show that with the help of the proposed feature engineering method, a shallow deep neural network outperforms other popular price forecasting models.  ( 2 min )
    A Feature Selection Method for Driver Stress Detection Using Heart Rate Variability and Breathing Rate. (arXiv:2302.01602v2 [cs.LG] UPDATED)
    Driver stress is a major cause of car accidents and death worldwide. Furthermore, persistent stress is a health problem, contributing to hypertension and other diseases of the cardiovascular system. Stress has a measurable impact on heart and breathing rates and stress levels can be inferred from such measurements. Galvanic skin response is a common test to measure the perspiration caused by both physiological and psychological stress, as well as extreme emotions. In this paper, galvanic skin response is used to estimate the ground truth stress levels. A feature selection technique based on the minimal redundancy-maximal relevance method is then applied to multiple heart rate variability and breathing rate metrics to identify a novel and optimal combination for use in detecting stress. The support vector machine algorithm with a radial basis function kernel was used along with these features to reliably predict stress. The proposed method has achieved a high level of accuracy on the target dataset.  ( 2 min )
    Incremental Learning in Diagonal Linear Networks. (arXiv:2208.14673v2 [cs.LG] UPDATED)
    Diagonal linear networks (DLNs) are a toy simplification of artificial neural networks; they consist in a quadratic reparametrization of linear regression inducing a sparse implicit regularization. In this paper, we describe the trajectory of the gradient flow of DLNs in the limit of small initialization. We show that incremental learning is effectively performed in the limit: coordinates are successively activated, while the iterate is the minimizer of the loss constrained to have support on the active coordinates only. This shows that the sparse implicit regularization of DLNs decreases with time. This work is restricted to the underparametrized regime with anti-correlated features for technical reasons.  ( 2 min )
    Blockchain-Enabled Federated Learning Approach for Vehicular Networks. (arXiv:2311.06372v1 [cs.LG])
    Data from interconnected vehicles may contain sensitive information such as location, driving behavior, personal identifiers, etc. Without adequate safeguards, sharing this data jeopardizes data privacy and system security. The current centralized data-sharing paradigm in these systems raises particular concerns about data privacy. Recognizing these challenges, the shift towards decentralized interactions in technology, as echoed by the principles of Industry 5.0, becomes paramount. This work is closely aligned with these principles, emphasizing decentralized, human-centric, and secure technological interactions in an interconnected vehicular ecosystem. To embody this, we propose a practical approach that merges two emerging technologies: Federated Learning (FL) and Blockchain. The integration of these technologies enables the creation of a decentralized vehicular network. In this setting, vehicles can learn from each other without compromising privacy while also ensuring data integrity and accountability. Initial experiments show that compared to conventional decentralized federated learning techniques, our proposed approach significantly enhances the performance and security of vehicular networks. The system's accuracy stands at 91.92\%. While this may appear to be low in comparison to state-of-the-art federated learning models, our work is noteworthy because, unlike others, it was achieved in a malicious vehicle setting. Despite the challenging environment, our method maintains high accuracy, making it a competent solution for preserving data privacy in vehicular networks.  ( 2 min )
    Non-contrastive representation learning for intervals from well logs. (arXiv:2209.14750v3 [cs.AI] UPDATED)
    The representation learning problem in the oil & gas industry aims to construct a model that provides a representation based on logging data for a well interval. Previous attempts are mainly supervised and focus on similarity task, which estimates closeness between intervals. We desire to build informative representations without using supervised (labelled) data. One of the possible approaches is self-supervised learning (SSL). In contrast to the supervised paradigm, this one requires little or no labels for the data. Nowadays, most SSL approaches are either contrastive or non-contrastive. Contrastive methods make representations of similar (positive) objects closer and distancing different (negative) ones. Due to possible wrong marking of positive and negative pairs, these methods can provide an inferior performance. Non-contrastive methods don't rely on such labelling and are widespread in computer vision. They learn using only pairs of similar objects that are easier to identify in logging data. We are the first to introduce non-contrastive SSL for well-logging data. In particular, we exploit Bootstrap Your Own Latent (BYOL) and Barlow Twins methods that avoid using negative pairs and focus only on matching positive pairs. The crucial part of these methods is an augmentation strategy. Our augmentation strategies and adaption of BYOL and Barlow Twins together allow us to achieve superior quality on clusterization and mostly the best performance on different classification tasks. Our results prove the usefulness of the proposed non-contrastive self-supervised approaches for representation learning and interval similarity in particular.  ( 3 min )
    Algorithmic Robustness. (arXiv:2311.06275v1 [cs.AI])
    Algorithmic robustness refers to the sustained performance of a computational system in the face of change in the nature of the environment in which that system operates or in the task that the system is meant to perform. Below, we motivate the importance of algorithmic robustness, present a conceptual framework, and highlight the relevant areas of research for which algorithmic robustness is relevant. Why robustness? Robustness is an important enabler of other goals that are frequently cited in the context of public policy decisions about computational systems, including trustworthiness, accountability, fairness, and safety. Despite this dependence, it tends to be under-recognized compared to these other concepts. This is unfortunate, because robustness is often more immediately achievable than these other ultimate goals, which can be more subjective and exacting. Thus, we highlight robustness as an important goal for researchers, engineers, regulators, and policymakers when considering the design, implementation, and deployment of computational systems. We urge researchers and practitioners to elevate the attention paid to robustness when designing and evaluating computational systems. For many key systems, the immediate question after any demonstration of high performance should be: "How robust is that performance to realistic changes in the task or environment?" Greater robustness will set the stage for systems that are more trustworthy, accountable, fair, and safe. Toward that end, this document provides a brief roadmap to some of the concepts and existing research around the idea of algorithmic robustness.  ( 2 min )
    Faithful Path Language Modelling for Explainable Recommendation over Knowledge Graph. (arXiv:2310.16452v2 [cs.IR] UPDATED)
    Path reasoning methods over knowledge graphs have gained popularity for their potential to improve transparency in recommender systems. However, the resulting models still rely on pre-trained knowledge graph embeddings, fail to fully exploit the interdependence between entities and relations in the KG for recommendation, and may generate inaccurate explanations. In this paper, we introduce PEARLM, a novel approach that efficiently captures user behaviour and product-side knowledge through language modelling. With our approach, knowledge graph embeddings are directly learned from paths over the KG by the language model, which also unifies entities and relations in the same optimisation space. Constraints on the sequence decoding additionally guarantee path faithfulness with respect to the KG. Experiments on two datasets show the effectiveness of our approach compared to state-of-the-art baselines. Source code and datasets: AVAILABLE AFTER GETTING ACCEPTED.  ( 2 min )
    Can Machine Learning Uncover Insights into Vehicle Travel Demand from Our Built Environment?. (arXiv:2311.06321v1 [cs.LG])
    In this paper, we propose a machine learning-based approach to address the lack of ability for designers to optimize urban land use planning from the perspective of vehicle travel demand. Research shows that our computational model can help designers quickly obtain feedback on the vehicle travel demand, which includes its total amount and temporal distribution based on the urban function distribution designed by the designers. It also assists in design optimization and evaluation of the urban function distribution from the perspective of vehicle travel. We obtain the city function distribution information and vehicle hours traveled (VHT) information by collecting the city point-of-interest (POI) data and online vehicle data. The artificial neural networks (ANNs) with the best performance in prediction are selected. By using data sets collected in different regions for mutual prediction and remapping the predictions onto a map for visualization, we evaluate the extent to which the computational model sees use across regions in an attempt to reduce the workload of future urban researchers. Finally, we demonstrate the application of the computational model to help designers obtain feedback on vehicle travel demand in the built environment and combine it with genetic algorithms to optimize the current state of the urban environment to provide recommendations to designers.
    Automated Design of Metaheuristic Algorithms: A Survey. (arXiv:2303.06532v2 [cs.NE] UPDATED)
    Metaheuristics have gained great success in academia and practice because their search logic can be applied to any problem with available solution representation, solution quality evaluation, and certain notions of locality. Manually designing metaheuristic algorithms for solving a target problem is criticized for being laborious, error-prone, and requiring intensive specialized knowledge. This gives rise to increasing interest in automated design of metaheuristic algorithms. With computing power to fully explore potential design choices, the automated design could reach and even surpass human-level design and could make high-performance algorithms accessible to a much wider range of researchers and practitioners. This paper presents a broad picture of automated design of metaheuristic algorithms, by conducting a survey on the common grounds and representative techniques in terms of design space, design strategies, performance evaluation strategies, and target problems in this field.
    Optimal and Private Learning from Human Response Data. (arXiv:2303.06234v2 [cs.LG] UPDATED)
    Item response theory (IRT) is the study of how people make probabilistic decisions, with diverse applications in education testing, recommendation systems, among others. The Rasch model of binary response data, one of the most fundamental models in IRT, remains an active area of research with important practical significance. Recently, Nguyen and Zhang (2022) proposed a new spectral estimation algorithm that is efficient and accurate. In this work, we extend their results in two important ways. Firstly, we obtain a refined entrywise error bound for the spectral algorithm, complementing the `average error' $\ell_2$ bound in their work. Notably, under mild sampling conditions, the spectral algorithm achieves the minimax optimal error bound (modulo a log factor). Building on the refined analysis, we also show that the spectral algorithm enjoys optimal sample complexity for top-$K$ recovery (e.g., identifying the best $K$ items from approval/disapproval response data), explaining the empirical findings in the previous work. Our second contribution addresses an important but understudied topic in IRT: privacy. Despite the human-centric applications of IRT, there has not been any proposed privacy-preserving mechanism in the literature. We develop a private extension of the spectral algorithm, leveraging its unique Markov chain formulation and the discrete Gaussian mechanism (Canonne et al., 2020). Experiments show that our approach is significantly more accurate than the baselines in the low-to-moderate privacy regime.
    Modulated Neural ODEs. (arXiv:2302.13262v3 [cs.LG] UPDATED)
    Neural ordinary differential equations (NODEs) have been proven useful for learning non-linear dynamics of arbitrary trajectories. However, current NODE methods capture variations across trajectories only via the initial state value or by auto-regressive encoder updates. In this work, we introduce Modulated Neural ODEs (MoNODEs), a novel framework that sets apart dynamics states from underlying static factors of variation and improves the existing NODE methods. In particular, we introduce $\textit{time-invariant modulator variables}$ that are learned from the data. We incorporate our proposed framework into four existing NODE variants. We test MoNODE on oscillating systems, videos and human walking trajectories, where each trajectory has trajectory-specific modulation. Our framework consistently improves the existing model ability to generalize to new dynamic parameterizations and to perform far-horizon forecasting. In addition, we verify that the proposed modulator variables are informative of the true unknown factors of variation as measured by $R^2$ scores.
    TURBO: The Swiss Knife of Auto-Encoders. (arXiv:2311.06527v1 [cs.LG])
    We present a novel information-theoretic framework, termed as TURBO, designed to systematically analyse and generalise auto-encoding methods. We start by examining the principles of information bottleneck and bottleneck-based networks in the auto-encoding setting and identifying their inherent limitations, which become more prominent for data with multiple relevant, physics-related representations. The TURBO framework is then introduced, providing a comprehensive derivation of its core concept consisting of the maximisation of mutual information between various data representations expressed in two directions reflecting the information flows. We illustrate that numerous prevalent neural network models are encompassed within this framework. The paper underscores the insufficiency of the information bottleneck concept in elucidating all such models, thereby establishing TURBO as a preferable theoretical reference. The introduction of TURBO contributes to a richer understanding of data representation and the structure of neural network models, enabling more efficient and versatile applications.
    Human-Aligned Calibration for AI-Assisted Decision Making. (arXiv:2306.00074v3 [cs.LG] UPDATED)
    Whenever a binary classifier is used to provide decision support, it typically provides both a label prediction and a confidence value. Then, the decision maker is supposed to use the confidence value to calibrate how much to trust the prediction. In this context, it has been often argued that the confidence value should correspond to a well calibrated estimate of the probability that the predicted label matches the ground truth label. However, multiple lines of empirical evidence suggest that decision makers have difficulties at developing a good sense on when to trust a prediction using these confidence values. In this paper, our goal is first to understand why and then investigate how to construct more useful confidence values. We first argue that, for a broad class of utility functions, there exist data distributions for which a rational decision maker is, in general, unlikely to discover the optimal decision policy using the above confidence values -- an optimal decision maker would need to sometimes place more (less) trust on predictions with lower (higher) confidence values. However, we then show that, if the confidence values satisfy a natural alignment property with respect to the decision maker's confidence on her own predictions, there always exists an optimal decision policy under which the level of trust the decision maker would need to place on predictions is monotone on the confidence values, facilitating its discoverability. Further, we show that multicalibration with respect to the decision maker's confidence on her own predictions is a sufficient condition for alignment. Experiments on four different AI-assisted decision making tasks where a classifier provides decision support to real human experts validate our theoretical results and suggest that alignment may lead to better decisions.
    Boosting Stock Price Prediction with Anticipated Macro Policy Changes. (arXiv:2311.06278v1 [q-fin.ST])
    Prediction of stock prices plays a significant role in aiding the decision-making of investors. Considering its importance, a growing literature has emerged trying to forecast stock prices with improved accuracy. In this study, we introduce an innovative approach for forecasting stock prices with greater accuracy. We incorporate external economic environment-related information along with stock prices. In our novel approach, we improve the performance of stock price prediction by taking into account variations due to future expected macroeconomic policy changes as investors adjust their current behavior ahead of time based on expected future macroeconomic policy changes. Furthermore, we incorporate macroeconomic variables along with historical stock prices to make predictions. Results from this strongly support the inclusion of future economic policy changes along with current macroeconomic information. We confirm the supremacy of our method over the conventional approach using several tree-based machine-learning algorithms. Results are strongly conclusive across various machine learning models. Our preferred model outperforms the conventional approach with an RMSE value of 1.61 compared to an RMSE value of 1.75 from the conventional approach.
    Mitigating Pooling Bias in E-commerce Search via False Negative Estimation. (arXiv:2311.06444v1 [cs.IR])
    Efficient and accurate product relevance assessment is critical for user experiences and business success. Training a proficient relevance assessment model requires high-quality query-product pairs, often obtained through negative sampling strategies. Unfortunately, current methods introduce pooling bias by mistakenly sampling false negatives, diminishing performance and business impact. To address this, we present Bias-mitigating Hard Negative Sampling (BHNS), a novel negative sampling strategy tailored to identify and adjust for false negatives, building upon our original False Negative Estimation algorithm. Our experiments in the Instacart search setting confirm BHNS as effective for practical e-commerce use. Furthermore, comparative analyses on public dataset showcase its domain-agnostic potential for diverse applications.
    SCADI: Self-supervised Causal Disentanglement in Latent Variable Models. (arXiv:2311.06567v1 [cs.LG])
    Causal disentanglement has great potential for capturing complex situations. However, there is a lack of practical and efficient approaches. It is already known that most unsupervised disentangling methods are unable to produce identifiable results without additional information, often leading to randomly disentangled output. Therefore, most existing models for disentangling are weakly supervised, providing information about intrinsic factors, which incurs excessive costs. Therefore, we propose a novel model, SCADI(SElf-supervised CAusal DIsentanglement), that enables the model to discover semantic factors and learn their causal relationships without any supervision. This model combines a masked structural causal model (SCM) with a pseudo-label generator for causal disentanglement, aiming to provide a new direction for self-supervised causal disentanglement models.
    Graph ODE with Factorized Prototypes for Modeling Complicated Interacting Dynamics. (arXiv:2311.06554v1 [cs.LG])
    This paper studies the problem of modeling interacting dynamical systems, which is critical for understanding physical dynamics and biological processes. Recent research predominantly uses geometric graphs to represent these interactions, which are then captured by powerful graph neural networks (GNNs). However, predicting interacting dynamics in challenging scenarios such as out-of-distribution shift and complicated underlying rules remains unsolved. In this paper, we propose a new approach named Graph ODE with factorized prototypes (GOAT) to address the problem. The core of GOAT is to incorporate factorized prototypes from contextual knowledge into a continuous graph ODE framework. Specifically, GOAT employs representation disentanglement and system parameters to extract both object-level and system-level contexts from historical trajectories, which allows us to explicitly model their independent influence and thus enhances the generalization capability under system changes. Then, we integrate these disentangled latent representations into a graph ODE model, which determines a combination of various interacting prototypes for enhanced model expressivity. The entire model is optimized using an end-to-end variational inference framework to maximize the likelihood. Extensive experiments in both in-distribution and out-of-distribution settings validate the superiority of GOAT.
    Deep Learning in Cardiology. (arXiv:1902.11122v4 [cs.CV] UPDATED)
    The medical field is creating large amount of data that physicians are unable to decipher and use efficiently. Moreover, rule-based expert systems are inefficient in solving complicated medical tasks or for creating insights using big data. Deep learning has emerged as a more accurate and effective technology in a wide range of medical problems such as diagnosis, prediction and intervention. Deep learning is a representation learning method that consists of layers that transform the data non-linearly, thus, revealing hierarchical relationships and structures. In this review we survey deep learning application papers that use structured data, signal and imaging modalities from cardiology. We discuss the advantages and limitations of applying deep learning in cardiology that also apply in medicine in general, while proposing certain directions as the most viable for clinical use.  ( 2 min )
    From Charts to Atlas: Merging Latent Spaces into One. (arXiv:2311.06547v1 [cs.LG])
    Models trained on semantically related datasets and tasks exhibit comparable inter-sample relations within their latent spaces. We investigate in this study the aggregation of such latent spaces to create a unified space encompassing the combined information. To this end, we introduce Relative Latent Space Aggregation, a two-step approach that first renders the spaces comparable using relative representations, and then aggregates them via a simple mean. We carefully divide a classification problem into a series of learning tasks under three different settings: sharing samples, classes, or neither. We then train a model on each task and aggregate the resulting latent spaces. We compare the aggregated space with that derived from an end-to-end model trained over all tasks and show that the two spaces are similar. We then observe that the aggregated space is better suited for classification, and empirically demonstrate that it is due to the unique imprints left by task-specific embedders within the representations. We finally test our framework in scenarios where no shared region exists and show that it can still be used to merge the spaces, albeit with diminished benefits over naive merging.
    Knowledge Graphs are not Created Equal: Exploring the Properties and Structure of Real KGs. (arXiv:2311.06414v1 [cs.LG])
    Despite the recent popularity of knowledge graph (KG) related tasks and benchmarks such as KG embeddings, link prediction, entity alignment and evaluation of the reasoning abilities of pretrained language models as KGs, the structure and properties of real KGs are not well studied. In this paper, we perform a large scale comparative study of 29 real KG datasets from diverse domains such as the natural sciences, medicine, and NLP to analyze their properties and structural patterns. Based on our findings, we make several recommendations regarding KG-based model development and evaluation. We believe that the rich structural information contained in KGs can benefit the development of better KG models across fields and we hope this study will contribute to breaking the existing data silos between different areas of research (e.g., ML, NLP, AI for sciences).
    Towards a data-driven debt collection strategy based on an advanced machine learning framework. (arXiv:2311.06292v1 [q-fin.ST])
    The European debt purchase market as measured by the total book value of purchased debt approached 25bn euros in 2020 and it was growing at double-digit rates. This is an example of how big the debt collection and debt purchase industry has grown and the important impact it has in the financial sector. However, in order to ensure an adequate return during the debt collection process, a good estimation of the propensity to pay and/or the expected cashflow is crucial. These estimations can be employed, for instance, to create different strategies during the amicable collection to maximize quality standards and revenues. And not only that, but also to prioritize the cases in which a legal process is necessary when debtors are unreachable for an amicable negotiation. This work offers a solution for these estimations. Specifically, a new machine learning modelling pipeline is presented showing how outperforms current strategies employed in the sector. The solution contains a pre-processing pipeline and a model selector based on the best model calibration. Performance is validated with real historical data of the debt industry.  ( 2 min )
    A Trichotomy for Transductive Online Learning. (arXiv:2311.06428v1 [cs.LG])
    We present new upper and lower bounds on the number of learner mistakes in the `transductive' online learning setting of Ben-David, Kushilevitz and Mansour (1997). This setting is similar to standard online learning, except that the adversary fixes a sequence of instances $x_1,\dots,x_n$ to be labeled at the start of the game, and this sequence is known to the learner. Qualitatively, we prove a trichotomy, stating that the minimal number of mistakes made by the learner as $n$ grows can take only one of precisely three possible values: $n$, $\Theta\left(\log (n)\right)$, or $\Theta(1)$. Furthermore, this behavior is determined by a combination of the VC dimension and the Littlestone dimension. Quantitatively, we show a variety of bounds relating the number of mistakes to well-known combinatorial dimensions. In particular, we improve the known lower bound on the constant in the $\Theta(1)$ case from $\Omega\left(\sqrt{\log(d)}\right)$ to $\Omega(\log(d))$ where $d$ is the Littlestone dimension. Finally, we extend our results to cover multiclass classification and the agnostic setting.
    Convolve and Conquer: Data Comparison with Wiener Filters. (arXiv:2311.06558v1 [cs.LG])
    Quantitative evaluations of differences and/or similarities between data samples define and shape optimisation problems associated with learning data distributions. Current methods to compare data often suffer from limitations in capturing such distributions or lack desirable mathematical properties for optimisation (e.g. smoothness, differentiability, or convexity). In this paper, we introduce a new method to measure (dis)similarities between paired samples inspired by Wiener-filter theory. The convolutional nature of Wiener filters allows us to comprehensively compare data samples in a globally correlated way. We validate our approach in four machine learning applications: data compression, medical imaging imputation, translated classification, and non-parametric generative modelling. Our results demonstrate increased resolution in reconstructed images with better perceptual quality and higher data fidelity, as well as robustness against translations, compared to conventional mean-squared-error analogue implementations.  ( 2 min )
    Topology-Matching Normalizing Flows for Out-of-Distribution Detection in Robot Learning. (arXiv:2311.06481v1 [cs.RO])
    To facilitate reliable deployments of autonomous robots in the real world, Out-of-Distribution (OOD) detection capabilities are often required. A powerful approach for OOD detection is based on density estimation with Normalizing Flows (NFs). However, we find that prior work with NFs attempts to match the complex target distribution topologically with naive base distributions leading to adverse implications. In this work, we circumvent this topological mismatch using an expressive class-conditional base distribution trained with an information-theoretic objective to match the required topology. The proposed method enjoys the merits of wide compatibility with existing learned models without any performance degradation and minimum computation overhead while enhancing OOD detection capabilities. We demonstrate superior results in density estimation and 2D object detection benchmarks in comparison with extensive baselines. Moreover, we showcase the applicability of the method with a real-robot deployment.  ( 2 min )
    A statistical perspective on algorithm unrolling models for inverse problems. (arXiv:2311.06395v1 [stat.ML])
    We consider inverse problems where the conditional distribution of the observation ${\bf y}$ given the latent variable of interest ${\bf x}$ (also known as the forward model) is known, and we have access to a data set in which multiple instances of ${\bf x}$ and ${\bf y}$ are both observed. In this context, algorithm unrolling has become a very popular approach for designing state-of-the-art deep neural network architectures that effectively exploit the forward model. We analyze the statistical complexity of the gradient descent network (GDN), an algorithm unrolling architecture driven by proximal gradient descent. We show that the unrolling depth needed for the optimal statistical performance of GDNs is of order $\log(n)/\log(\varrho_n^{-1})$, where $n$ is the sample size, and $\varrho_n$ is the convergence rate of the corresponding gradient descent algorithm. We also show that when the negative log-density of the latent variable ${\bf x}$ has a simple proximal operator, then a GDN unrolled at depth $D'$ can solve the inverse problem at the parametric rate $O(D'/\sqrt{n})$. Our results thus also suggest that algorithm unrolling models are prone to overfitting as the unrolling depth $D'$ increases. We provide several examples to illustrate these results.  ( 2 min )
    Distilling Large Language Models using Skill-Occupation Graph Context for HR-Related Tasks. (arXiv:2311.06383v1 [cs.CL])
    Numerous HR applications are centered around resumes and job descriptions. While they can benefit from advancements in NLP, particularly large language models, their real-world adoption faces challenges due to absence of comprehensive benchmarks for various HR tasks, and lack of smaller models with competitive capabilities. In this paper, we aim to bridge this gap by introducing the Resume-Job Description Benchmark (RJDB). We meticulously craft this benchmark to cater to a wide array of HR tasks, including matching and explaining resumes to job descriptions, extracting skills and experiences from resumes, and editing resumes. To create this benchmark, we propose to distill domain-specific knowledge from a large language model (LLM). We rely on a curated skill-occupation graph to ensure diversity and provide context for LLMs generation. Our benchmark includes over 50 thousand triples of job descriptions, matched resumes and unmatched resumes. Using RJDB, we train multiple smaller student models. Our experiments reveal that the student models achieve near/better performance than the teacher model (GPT-4), affirming the effectiveness of the benchmark. Additionally, we explore the utility of RJDB on out-of-distribution data for skill extraction and resume-job description matching, in zero-shot and weak supervision manner. We release our datasets and code to foster further research and industry applications.  ( 2 min )
    BClean: A Bayesian Data Cleaning System. (arXiv:2311.06517v1 [cs.AI])
    There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian methods. However, existing probabilistic methods often assume a simplistic distribution (e.g., Gaussian distribution), which is frequently underfitted in practice, or they necessitate experts to provide a complex prior distribution (e.g., via a programming language). This requirement is both labor-intensive and costly, rendering these methods less suitable for real-world applications. In this paper, we propose BClean, a Bayesian Cleaning system that features automatic Bayesian network construction and user interaction. We recast the data cleaning problem as a Bayesian inference that fully exploits the relationships between attributes in the observed dataset and any prior information provided by users. To this end, we present an automatic Bayesian network construction method that extends a structure learning-based functional dependency discovery method with similarity functions to capture the relationships between attributes. Furthermore, our system allows users to modify the generated Bayesian network in order to specify prior information or correct inaccuracies identified by the automatic generation process. We also design an effective scoring model (called the compensative scoring model) necessary for the Bayesian inference. To enhance the efficiency of data cleaning, we propose several approximation strategies for the Bayesian inference, including graph partitioning, domain pruning, and pre-detection. By evaluating on both real-world and synthetic datasets, we demonstrate that BClean is capable of achieving an F-measure of up to 0.9 in data cleaning, outperforming existing Bayesian methods by 2% and other data cleaning methods by 15%.  ( 3 min )
    Gradual Optimization Learning for Conformational Energy Minimization. (arXiv:2311.06295v1 [physics.chem-ph])
    Molecular conformation optimization is crucial to computer-aided drug discovery and materials design. Traditional energy minimization techniques rely on iterative optimization methods that use molecular forces calculated by a physical simulator (oracle) as anti-gradients. However, this is a computationally expensive approach that requires many interactions with a physical simulator. One way to accelerate this procedure is to replace the physical simulator with a neural network. Despite recent progress in neural networks for molecular conformation energy prediction, such models are prone to distribution shift, leading to inaccurate energy minimization. We find that the quality of energy minimization with neural networks can be improved by providing optimization trajectories as additional training data. Still, it takes around $5 \times 10^5$ additional conformations to match the physical simulator's optimization quality. In this work, we present the Gradual Optimization Learning Framework (GOLF) for energy minimization with neural networks that significantly reduces the required additional data. The framework consists of an efficient data-collecting scheme and an external optimizer. The external optimizer utilizes gradients from the energy prediction model to generate optimization trajectories, and the data-collecting scheme selects additional training data to be processed by the physical simulator. Our results demonstrate that the neural network trained with GOLF performs on par with the oracle on a benchmark of diverse drug-like molecules using $50$x less additional data.  ( 2 min )
    Towards A Unified Neural Architecture for Visual Recognition and Reasoning. (arXiv:2311.06386v1 [cs.CV])
    Recognition and reasoning are two pillars of visual understanding. However, these tasks have an imbalance in focus; whereas recent advances in neural networks have shown strong empirical performance in visual recognition, there has been comparably much less success in solving visual reasoning. Intuitively, unifying these two tasks under a singular framework is desirable, as they are mutually dependent and beneficial. Motivated by the recent success of multi-task transformers for visual recognition and language understanding, we propose a unified neural architecture for visual recognition and reasoning with a generic interface (e.g., tokens) for both. Our framework enables the principled investigation of how different visual recognition tasks, datasets, and inductive biases can help enable spatiotemporal reasoning capabilities. Noticeably, we find that object detection, which requires spatial localization of individual objects, is the most beneficial recognition task for reasoning. We further demonstrate via probing that implicit object-centric representations emerge automatically inside our framework. Intriguingly, we discover that certain architectural choices such as the backbone model of the visual encoder have a significant impact on visual reasoning, but little on object detection. Given the results of our experiments, we believe that visual reasoning should be considered as a first-class citizen alongside visual recognition, as they are strongly correlated but benefit from potentially different design choices.  ( 2 min )
    Privacy-Engineered Value Decomposition Networks for Cooperative Multi-Agent Reinforcement Learning. (arXiv:2311.06255v1 [cs.MA])
    In cooperative multi-agent reinforcement learning (Co-MARL), a team of agents must jointly optimize the team's long-term rewards to learn a designated task. Optimizing rewards as a team often requires inter-agent communication and data sharing, leading to potential privacy implications. We assume privacy considerations prohibit the agents from sharing their environment interaction data. Accordingly, we propose Privacy-Engineered Value Decomposition Networks (PE-VDN), a Co-MARL algorithm that models multi-agent coordination while provably safeguarding the confidentiality of the agents' environment interaction data. We integrate three privacy-engineering techniques to redesign the data flows of the VDN algorithm, an existing Co-MARL algorithm that consolidates the agents' environment interaction data to train a central controller that models multi-agent coordination, and develop PE-VDN. In the first technique, we design a distributed computation scheme that eliminates Vanilla VDN's dependency on sharing environment interaction data. Then, we utilize a privacy-preserving multi-party computation protocol to guarantee that the data flows of the distributed computation scheme do not pose new privacy risks. Finally, we enforce differential privacy to preempt inference threats against the agents' training data, past environment interactions, when they take actions based on their neural network predictions. We implement PE-VDN in StarCraft Multi-Agent Competition (SMAC) and show that it achieves 80% of Vanilla VDN's win rate while maintaining differential privacy levels that provide meaningful privacy guarantees. The results demonstrate that PE-VDN can safeguard the confidentiality of agents' environment interaction data without sacrificing multi-agent coordination.  ( 2 min )
    Reviewing Developments of Graph Convolutional Network Techniques for Recommendation Systems. (arXiv:2311.06323v1 [cs.IR])
    The Recommender system is a vital information service on today's Internet. Recently, graph neural networks have emerged as the leading approach for recommender systems. We try to review recent literature on graph neural network-based recommender systems, covering the background and development of both recommender systems and graph neural networks. Then categorizing recommender systems by their settings and graph neural networks by spectral and spatial models, we explore the motivation behind incorporating graph neural networks into recommender systems. We also analyze challenges and open problems in graph construction, embedding propagation and aggregation, and computation efficiency. This guides us to better explore the future directions and developments in this domain.  ( 2 min )
    STRIDE: Structure-guided Generation for Inverse Design of Molecules. (arXiv:2311.06297v1 [physics.chem-ph])
    Machine learning and especially deep learning has had an increasing impact on molecule and materials design. In particular, given the growing access to an abundance of high-quality small molecule data for generative modeling for drug design, results for drug discovery have been promising. However, for many important classes of materials such as catalysts, antioxidants, and metal-organic frameworks, such large datasets are not available. Such families of molecules with limited samples and structural similarities are especially prevalent for industrial applications. As is well-known, retraining and even fine-tuning are challenging on such small datasets. Novel, practically applicable molecules are most often derivatives of well-known molecules, suggesting approaches to addressing data scarcity. To address this problem, we introduce $\textbf{STRIDE}$, a generative molecule workflow that generates novel molecules with an unconditional generative model guided by known molecules without any retraining. We generate molecules outside of the training data from a highly specialized set of antioxidant molecules. Our generated molecules have on average 21.7% lower synthetic accessibility scores and also reduce ionization potential by 5.9% of generated molecules via guiding.  ( 2 min )
    Image Classification using Combination of Topological Features and Neural Networks. (arXiv:2311.06375v1 [cs.CV])
    In this work we use the persistent homology method, a technique in topological data analysis (TDA), to extract essential topological features from the data space and combine them with deep learning features for classification tasks. In TDA, the concepts of complexes and filtration are building blocks. Firstly, a filtration is constructed from some complex. Then, persistent homology classes are computed, and their evolution along the filtration is visualized through the persistence diagram. Additionally, we applied vectorization techniques to the persistence diagram to make this topological information compatible with machine learning algorithms. This was carried out with the aim of classifying images from multiple classes in the MNIST dataset. Our approach inserts topological features into deep learning approaches composed by single and two-streams neural networks architectures based on a multi-layer perceptron (MLP) and a convolutional neral network (CNN) taylored for multi-class classification in the MNIST dataset. In our analysis, we evaluated the obtained results and compared them with the outcomes achieved through the baselines that are available in the TensorFlow library. The main conclusion is that topological information may increase neural network accuracy in multi-class classification tasks with the price of computational complexity of persistent homology calculation. Up to the best of our knowledge, it is the first work that combines deep learning features and the combination of topological features for multi-class classification tasks.  ( 3 min )
    Retro-BLEU: Quantifying Chemical Plausibility of Retrosynthesis Routes through Reaction Template Sequence Analysis. (arXiv:2311.06304v1 [cs.LG])
    Computer-assisted methods have emerged as valuable tools for retrosynthesis analysis. However, quantifying the plausibility of generated retrosynthesis routes remains a challenging task. We introduce Retro-BLEU, a statistical metric adapted from the well-established BLEU score in machine translation, to evaluate the plausibility of retrosynthesis routes based on reaction template sequences analysis. We demonstrate the effectiveness of Retro-BLEU by applying it to a diverse set of retrosynthesis routes generated by state-of-the-art algorithms and compare the performance with other evaluation metrics. The results show that Retro-BLEU is capable of differentiating between plausible and implausible routes. Furthermore, we provide insights into the strengths and weaknesses of Retro-BLEU, paving the way for future developments and improvements in this field.  ( 2 min )
    ShipGen: A Diffusion Model for Parametric Ship Hull Generation with Multiple Objectives and Constraints. (arXiv:2311.06315v1 [cs.LG])
    Ship design is a years-long process that requires balancing complex design trade-offs to create a ship that is efficient and effective. Finding new ways to improve the ship design process can lead to significant cost savings for ship building and operation. One promising technology is generative artificial intelligence, which has been shown to reduce design cycle time and create novel, high-performing designs. In literature review, generative artificial intelligence has been shown to generate ship hulls; however, ship design is particularly difficult as the hull of a ship requires the consideration of many objectives. This paper presents a study on the generation of parametric ship hull designs using a parametric diffusion model that considers multiple objectives and constraints for the hulls. This denoising diffusion probabilistic model (DDPM) generates the tabular parametric design vectors of a ship hull for evaluation. In addition to a tabular DDPM, this paper details adding guidance to improve the quality of generated ship hull designs. By leveraging classifier guidance, the DDPM produced feasible parametric ship hulls that maintain the coverage of the initial training dataset of ship hulls with a 99.5% rate, a 149x improvement over random sampling of the design vector parameters across the design space. Parametric ship hulls produced with performance guidance saw an average of 91.4% reduction in wave drag coefficients and an average of a 47.9x relative increase in the total displaced volume of the hulls compared to the mean performance of the hulls in the training dataset. The use of a DDPM to generate parametric ship hulls can reduce design time by generating high-performing hull designs for future analysis. These generated hulls have low drag and high volume, which can reduce the cost of operating a ship and increase its potential to generate revenue.  ( 3 min )
    Parallelization of an Ubiquitous Sequential Computation. (arXiv:2311.06281v1 [cs.DS])
    We show how to compute the elements of a sequence $x_t = a_t x_{t-1} + b_t$ in parallel, given $t = (1, 2, \dots, n)$, $a_t \in \mathbb{R}^n$, $b_t \in \mathbb{R}^n$, and initial value $x_0 \in \mathbb{R}$. On $n$ parallel processors, the computation of $n$ elements incurs $\mathcal{O}(\log n)$ time and $\mathcal{O}(n)$ space. Sequences of this form are ubiquitous in science and engineering, making their parallelization useful for a vast number of applications. We implement parallelization in software, test it on parallel hardware, and verify that it executes faster than sequential computation by a factor of $\frac{n}{\log n}$.  ( 2 min )
    Transfer Learning for Structured Pruning under Limited Task Data. (arXiv:2311.06382v1 [cs.CL])
    Large, pre-trained models are problematic to use in resource constrained applications. Fortunately, task-aware structured pruning methods offer a solution. These approaches reduce model size by dropping structural units like layers and attention heads in a manner that takes into account the end-task. However, these pruning algorithms require more task-specific data than is typically available. We propose a framework which combines structured pruning with transfer learning to reduce the need for task-specific data. Our empirical results answer questions such as: How should the two tasks be coupled? What parameters should be transferred? And, when during training should transfer learning be introduced? Leveraging these insights, we demonstrate that our framework results in pruned models with improved generalization over strong baselines.  ( 2 min )
    ODTlearn: A Package for Learning Optimal Decision Trees for Prediction and Prescription. (arXiv:2307.15691v2 [stat.ML] UPDATED)
    ODTLearn is an open-source Python package that provides methods for learning optimal decision trees for high-stakes predictive and prescriptive tasks based on the mixed-integer optimization (MIO) framework proposed in Aghaei et al. (2019) and several of its extensions. The current version of the package provides implementations for learning optimal classification trees, optimal fair classification trees, optimal classification trees robust to distribution shifts, and optimal prescriptive trees from observational data. We have designed the package to be easy to maintain and extend as new optimal decision tree problem classes, reformulation strategies, and solution algorithms are introduced. To this end, the package follows object-oriented design principles and supports both commercial (Gurobi) and open source (COIN-OR branch and cut) solvers. The package documentation and an extensive user guide can be found at https://d3m-research-group.github.io/odtlearn/. Additionally, users can view the package source code and submit feature requests and bug reports by visiting https://github.com/D3M-Research-Group/odtlearn.  ( 2 min )
    Mixture Weight Estimation and Model Prediction in Multi-source Multi-target Domain Adaptation. (arXiv:2309.10736v2 [cs.LG] UPDATED)
    We consider the problem of learning a model from multiple heterogeneous sources with the goal of performing well on a new target distribution. The goal of learner is to mix these data sources in a target-distribution aware way and simultaneously minimize the empirical risk on the mixed source. The literature has made some tangible advancements in establishing theory of learning on mixture domain. However, there are still two unsolved problems. Firstly, how to estimate the optimal mixture of sources, given a target domain; Secondly, when there are numerous target domains, how to solve empirical risk minimization (ERM) for each target using possibly unique mixture of data sources in a computationally efficient manner. In this paper we address both problems efficiently and with guarantees. We cast the first problem, mixture weight estimation, as a convex-nonconcave compositional minimax problem, and propose an efficient stochastic algorithm with provable stationarity guarantees. Next, for the second problem, we identify that for certain regimes, solving ERM for each target domain individually can be avoided, and instead parameters for a target optimal model can be viewed as a non-linear function on a space of the mixture coefficients. Building upon this, we show that in the offline setting, a GD-trained overparameterized neural network can provably learn such function to predict the model of target domain instead of solving a designated ERM problem. Finally, we also consider an online setting and propose a label efficient online algorithm, which predicts parameters for new targets given an arbitrary sequence of mixing coefficients, while enjoying regret guarantees.  ( 3 min )
    Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs. (arXiv:2310.11689v2 [cs.CL] UPDATED)
    Large language models (LLMs) have recently shown great advances in a variety of tasks, including natural language understanding and generation. However, their use in high-stakes decision-making scenarios is still limited due to the potential for errors. Selective prediction is a technique that can be used to improve the reliability of the LLMs by allowing them to abstain from making predictions when they are unsure of the answer. In this work, we propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of LLMs. Our framework is based on the idea of using parameter-efficient tuning to adapt the LLM to the specific task at hand while improving its ability to perform self-evaluation. We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods. For example, on the CoQA benchmark, our method improves the AUACC from 91.23% to 92.63% and improves the AUROC from 74.61% to 80.25%.  ( 2 min )
    Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant. (arXiv:2306.09338v3 [cs.LG] UPDATED)
    This article provides a comprehensive understanding of optimization in deep learning, with a primary focus on the challenges of gradient vanishing and gradient exploding, which normally lead to diminished model representational ability and training instability, respectively. We analyze these two challenges through several strategic measures, including the improvement of gradient flow and the imposition of constraints on a network's Lipschitz constant. To help understand the current optimization methodologies, we categorize them into two classes: explicit optimization and implicit optimization. Explicit optimization methods involve direct manipulation of optimizer parameters, including weight, gradient, learning rate, and weight decay. Implicit optimization methods, by contrast, focus on improving the overall landscape of a network by enhancing its modules, such as residual shortcuts, normalization methods, attention mechanisms, and activations. In this article, we provide an in-depth analysis of these two optimization classes and undertake a thorough examination of the Jacobian matrices and the Lipschitz constants of many widely used deep learning modules, highlighting existing issues as well as potential improvements. Moreover, we also conduct a series of analytical experiments to substantiate our theoretical discussions. This article does not aim to propose a new optimizer or network. Rather, our intention is to present a comprehensive understanding of optimization in deep learning. We hope that this article will assist readers in gaining a deeper insight in this field and encourages the development of more robust, efficient, and high-performing models.  ( 3 min )
    Steady-State Analysis and Online Learning for Queues with Hawkes Arrivals. (arXiv:2311.02577v2 [math.PR] UPDATED)
    We investigate the long-run behavior of single-server queues with Hawkes arrivals and general service distributions and related optimization problems. In detail, utilizing novel coupling techniques, we establish finite moment bounds for the stationary distribution of the workload and busy period processes. In addition, we are able to show that, those queueing processes converge exponentially fast to their stationary distribution. Based on these theoretic results, we develop an efficient numerical algorithm to solve the optimal staffing problem for the Hawkes queues in a data-driven manner. Numerical results indicate a sharp difference in staffing for Hawkes queues, compared to the classic GI/GI/1 model, especially in the heavy-traffic regime.  ( 2 min )
    Stochastic First-Order Learning for Large-Scale Flexibly Tied Gaussian Mixture Model. (arXiv:2212.05402v3 [cs.LG] UPDATED)
    Gaussian Mixture Models (GMMs) are one of the most potent parametric density models used extensively in many applications. Flexibly-tied factorization of the covariance matrices in GMMs is a powerful approach for coping with the challenges of common GMMs when faced with high-dimensional data and complex densities which often demand a large number of Gaussian components. However, the expectation-maximization algorithm for fitting flexibly-tied GMMs still encounters difficulties with streaming and very large dimensional data. To overcome these challenges, this paper suggests the use of first-order stochastic optimization algorithms. Specifically, we propose a new stochastic optimization algorithm on the manifold of orthogonal matrices. Through numerous empirical results on both synthetic and real datasets, we observe that stochastic optimization methods can outperform the expectation-maximization algorithm in terms of attaining better likelihood, needing fewer epochs for convergence, and consuming less time per each epoch.  ( 2 min )
    Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity. (arXiv:2307.08286v2 [cs.LG] UPDATED)
    Recent work has revealed many intriguing empirical phenomena in neural network training, despite the poorly understood and highly complex loss landscapes and training dynamics. One of these phenomena, Linear Mode Connectivity (LMC), has gained considerable attention due to the intriguing observation that different solutions can be connected by a linear path in the parameter space while maintaining near-constant training and test losses. In this work, we introduce a stronger notion of linear connectivity, Layerwise Linear Feature Connectivity (LLFC), which says that the feature maps of every layer in different trained networks are also linearly connected. We provide comprehensive empirical evidence for LLFC across a wide range of settings, demonstrating that whenever two trained networks satisfy LMC (via either spawning or permutation methods), they also satisfy LLFC in nearly all the layers. Furthermore, we delve deeper into the underlying factors contributing to LLFC, which reveal new insights into the spawning and permutation approaches. The study of LLFC transcends and advances our understanding of LMC by adopting a feature-learning perspective.  ( 2 min )
    Convolutional Monge Mapping Normalization for learning on sleep data. (arXiv:2305.18831v3 [eess.SP] UPDATED)
    In many machine learning applications on signals and biomedical data, especially electroencephalogram (EEG), one major challenge is the variability of the data across subjects, sessions, and hardware devices. In this work, we propose a new method called Convolutional Monge Mapping Normalization (CMMN), which consists in filtering the signals in order to adapt their power spectrum density (PSD) to a Wasserstein barycenter estimated on training data. CMMN relies on novel closed-form solutions for optimal transport mappings and barycenters and provides individual test time adaptation to new data without needing to retrain a prediction model. Numerical experiments on sleep EEG data show that CMMN leads to significant and consistent performance gains independent from the neural network architecture when adapting between subjects, sessions, and even datasets collected with different hardware. Notably our performance gain is on par with much more numerically intensive Domain Adaptation (DA) methods and can be used in conjunction with those for even better performances.  ( 2 min )
    CALLOC: Curriculum Adversarial Learning for Secure and Robust Indoor Localization. (arXiv:2311.06361v1 [cs.LG])
    Indoor localization has become increasingly vital for many applications from tracking assets to delivering personalized services. Yet, achieving pinpoint accuracy remains a challenge due to variations across indoor environments and devices used to assist with localization. Another emerging challenge is adversarial attacks on indoor localization systems that not only threaten service integrity but also reduce localization accuracy. To combat these challenges, we introduce CALLOC, a novel framework designed to resist adversarial attacks and variations across indoor environments and devices that reduce system accuracy and reliability. CALLOC employs a novel adaptive curriculum learning approach with a domain specific lightweight scaled-dot product attention neural network, tailored for adversarial and variation resilience in practical use cases with resource constrained mobile devices. Experimental evaluations demonstrate that CALLOC can achieve improvements of up to 6.03x in mean error and 4.6x in worst-case error against state-of-the-art indoor localization frameworks, across diverse building floorplans, mobile devices, and adversarial attacks scenarios.  ( 2 min )
    Reversible and irreversible bracket-based dynamics for deep graph neural networks. (arXiv:2305.15616v2 [cs.LG] UPDATED)
    Recent works have shown that physics-inspired architectures allow the training of deep graph neural networks (GNNs) without oversmoothing. The role of these physics is unclear, however, with successful examples of both reversible (e.g., Hamiltonian) and irreversible (e.g., diffusion) phenomena producing comparable results despite diametrically opposed mechanisms, and further complications arising due to empirical departures from mathematical theory. This work presents a series of novel GNN architectures based upon structure-preserving bracket-based dynamical systems, which are provably guaranteed to either conserve energy or generate positive dissipation with increasing depth. It is shown that the theoretically principled framework employed here allows for inherently explainable constructions, which contextualize departures from theory in current architectures and better elucidate the roles of reversibility and irreversibility in network performance.  ( 2 min )
    Quantum Data Compression and Quantum Cross Entropy. (arXiv:2106.13823v3 [quant-ph] UPDATED)
    The emerging field of quantum machine learning has the potential of revolutionizing our perspectives of quantum computing and artificial intelligence. In the predominantly empirical realm of quantum machine learning, a theoretical void persists. This paper addresses the gap by highlighting the quantum cross entropy, a pivotal counterpart to the classical cross entropy. We establish quantum cross entropy's role in quantum data compression, a fundamental machine learning task, by demonstrating that it acts as the compression rate for sub-optimal quantum source coding. Our approach involves a novel, universal quantum data compression protocol based on the quantum generalization of variable-length coding and the principle of quantum strong typicality. This reveals that quantum cross entropy can effectively serve as a loss function in quantum machine learning algorithms. Furthermore, we illustrate that the minimum of quantum cross entropy aligns with the von Neumann entropy, reinforcing its role as the optimal compression rate and underscoring its significance in advancing our understanding of quantum machine learning's theoretical framework.  ( 2 min )
    FastCLIPstyler: Optimisation-free Text-based Image Style Transfer Using Style Representations. (arXiv:2210.03461v3 [cs.CV] UPDATED)
    In recent years, language-driven artistic style transfer has emerged as a new type of style transfer technique, eliminating the need for a reference style image by using natural language descriptions of the style. The first model to achieve this, called CLIPstyler, has demonstrated impressive stylisation results. However, its lengthy optimisation procedure at runtime for each query limits its suitability for many practical applications. In this work, we present FastCLIPstyler, a generalised text-based image style transfer model capable of stylising images in a single forward pass for arbitrary text inputs. Furthermore, we introduce EdgeCLIPstyler, a lightweight model designed for compatibility with resource-constrained devices. Through quantitative and qualitative comparisons with state-of-the-art approaches, we demonstrate that our models achieve superior stylisation quality based on measurable metrics while offering significantly improved runtime efficiency, particularly on edge devices.  ( 2 min )
    Semantic segmentation of sparse irregular point clouds for leaf/wood discrimination. (arXiv:2305.16963v2 [cs.CV] UPDATED)
    LiDAR (Light Detection and Ranging) has become an essential part of the remote sensing toolbox used for biosphere monitoring. In particular, LiDAR provides the opportunity to map forest leaf area with unprecedented accuracy, while leaf area has remained an important source of uncertainty affecting models of gas exchanges between the vegetation and the atmosphere. Unmanned Aerial Vehicles (UAV) are easy to mobilize and therefore allow frequent revisits to track the response of vegetation to climate change. However, miniature sensors embarked on UAVs usually provide point clouds of limited density, which are further affected by a strong decrease in density from top to bottom of the canopy due to progressively stronger occlusion. In such a context, discriminating leaf points from wood points presents a significant challenge due in particular to strong class imbalance and spatially irregular sampling intensity. Here we introduce a neural network model based on the Pointnet ++ architecture which makes use of point geometry only (excluding any spectral information). To cope with local data sparsity, we propose an innovative sampling scheme which strives to preserve local important geometric information. We also propose a loss function adapted to the severe class imbalance. We show that our model outperforms state-of-the-art alternatives on UAV point clouds. We discuss future possible improvements, particularly regarding much denser point clouds acquired from below the canopy.  ( 3 min )
    Minimum Description Length Hopfield Networks. (arXiv:2311.06518v1 [cs.LG])
    Associative memory architectures are designed for memorization but also offer, through their retrieval method, a form of generalization to unseen inputs: stored memories can be seen as prototypes from this point of view. Focusing on Modern Hopfield Networks (MHN), we show that a large memorization capacity undermines the generalization opportunity. We offer a solution to better optimize this tradeoff. It relies on Minimum Description Length (MDL) to determine during training which memories to store, as well as how many of them.  ( 2 min )
    Phase Transitions of Civil Unrest across Countries and Time. (arXiv:2306.08698v3 [physics.soc-ph] UPDATED)
    Phase transitions, characterized by abrupt shifts between macroscopic patterns of organization, are ubiquitous in complex systems. Despite considerable research in the physical and natural sciences, the empirical study of this phenomenon in societal systems is relatively underdeveloped. The goal of this study is to explore whether the dynamics of collective civil unrest can be plausibly characterized as a sequence of recurrent phase shifts, with each phase having measurable and identifiable latent characteristics. Building on previous efforts to characterize civil unrest as a self-organized critical system, we introduce a macro-level statistical model of civil unrest and evaluate its plausibility using a comprehensive dataset of civil unrest events in 170 countries from 1946 to 2017. Our findings demonstrate that the macro-level phase model effectively captures the characteristics of civil unrest data from diverse countries globally and that universal mechanisms may underlie certain aspects of the dynamics of civil unrest. We also introduce a scale to quantify a country's long-term unrest per unit of time and show that civil unrest events tend to cluster geographically, with the magnitude of civil unrest concentrated in specific regions. Our approach has the potential to identify and measure phase transitions in various collective human phenomena beyond civil unrest, contributing to a better understanding of complex social systems.  ( 3 min )
    Fast Non-Rigid Radiance Fields from Monocularized Data. (arXiv:2212.01368v2 [cs.CV] UPDATED)
    The reconstruction and novel view synthesis of dynamic scenes recently gained increased attention. As reconstruction from large-scale multi-view data involves immense memory and computational requirements, recent benchmark datasets provide collections of single monocular views per timestamp sampled from multiple (virtual) cameras. We refer to this form of inputs as "monocularized" data. Existing work shows impressive results for synthetic setups and forward-facing real-world data, but is often limited in the training speed and angular range for generating novel views. This paper addresses these limitations and proposes a new method for full 360{\deg} inward-facing novel view synthesis of non-rigidly deforming scenes. At the core of our method are: 1) An efficient deformation module that decouples the processing of spatial and temporal information for accelerated training and inference; and 2) A static module representing the canonical scene as a fast hash-encoded neural radiance field. In addition to existing synthetic monocularized data, we systematically analyze the performance on real-world inward-facing scenes using a newly recorded challenging dataset sampled from a synchronized large-scale multi-view rig. In both cases, our method is significantly faster than previous methods, converging in less than 7 minutes and achieving real-time framerates at 1K resolution, while obtaining a higher visual accuracy for generated novel views. Our source code and data is available at our project page https://graphics.tu-bs.de/publications/kappel2022fast.  ( 3 min )
    ParticleWNN: a Novel Neural Networks Framework for Solving Partial Differential Equations. (arXiv:2305.12433v3 [cs.LG] UPDATED)
    Deep neural networks (DNNs) have been widely used to solve partial differential equations (PDEs) in recent years. In this work, a novel deep learning-based framework named Particle Weak-form based Neural Networks (ParticleWNN) is developed for solving PDEs in the weak form. In this framework, the trial space is defined as the space of DNNs, while the test space consists of functions compactly supported in extremely small regions, centered around particles. To facilitate the training of neural networks, an R-adaptive strategy is designed to adaptively modify the radius of regions during training. The ParticleWNN inherits the benefits of weak/variational formulation, requiring less regularity of the solution and a small number of quadrature points for computing integrals. Additionally, due to the special construction of the test functions, ParticleWNN enables parallel implementation and integral calculations only in extremely small regions. This framework is particularly desirable for solving problems with high-dimensional and complex domains. The efficiency and accuracy of ParticleWNN are demonstrated through several numerical examples, showcasing its superiority over state-of-the-art methods. The source code for the numerical examples presented in this paper is available at https://github.com/yaohua32/ParticleWNN.  ( 2 min )
    Interpreting Neural Policies with Disentangled Tree Representations. (arXiv:2210.06650v2 [cs.LG] UPDATED)
    The advancement of robots, particularly those functioning in complex human-centric environments, relies on control solutions that are driven by machine learning. Understanding how learning-based controllers make decisions is crucial since robots are often safety-critical systems. This urges a formal and quantitative understanding of the explanatory factors in the interpretability of robot learning. In this paper, we aim to study interpretability of compact neural policies through the lens of disentangled representation. We leverage decision trees to obtain factors of variation [1] for disentanglement in robot learning; these encapsulate skills, behaviors, or strategies toward solving tasks. To assess how well networks uncover the underlying task dynamics, we introduce interpretability metrics that measure disentanglement of learned neural dynamics from a concentration of decisions, mutual information and modularity perspective. We showcase the effectiveness of the connection between interpretability and disentanglement consistently across extensive experimental analysis.  ( 2 min )
    Plume: A Framework for High Performance Deep RL Network Controllers via Prioritized Trace Sampling. (arXiv:2302.12403v2 [cs.LG] UPDATED)
    Deep Reinforcement Learning (DRL) has shown promise in various networking environments. However, these environments present several fundamental challenges for standard DRL techniques. They are difficult to explore and exhibit high levels of noise and uncertainty. Although these challenges complicate the training process, we find that in practice we can substantially mitigate their effects and even achieve state-of-the-art real-world performance by addressing a factor that has been previously overlooked: the skewed input trace distribution in DRL training datasets. We introduce a generalized framework, Plume, to automatically identify and balance the skew using a three-stage process. First, we identify the critical features that determine the behavior of the traces. Second, we classify the traces into clusters. Finally, we prioritize the salient clusters to improve the overall performance of the controller. Plume seamlessly works across DRL algorithms, without requiring any changes to the DRL workflow. We evaluated Plume on three networking environments, including Adaptive Bitrate Streaming, Congestion Control, and Load Balancing. Plume offers superior performance in both simulation and real-world settings, across different controllers and DRL algorithms. For example, our novel ABR controller, Gelato trained with Plume consistently outperforms prior state-of-the-art controllers on the live streaming platform Puffer for over a year. It is the first controller on the platform to deliver statistically significant improvements in both video quality and stalling, decreasing stalls by as much as 75%.  ( 3 min )
    Learning Likelihoods with Conditional Normalizing Flows. (arXiv:1912.00042v2 [cs.LG] UPDATED)
    Normalizing Flows (NFs) are able to model complicated distributions p(y) with strong inter-dimensional correlations and high multimodality by transforming a simple base density p(z) through an invertible neural network under the change of variables formula. Such behavior is desirable in multivariate structured prediction tasks, where handcrafted per-pixel loss-based methods inadequately capture strong correlations between output dimensions. We present a study of conditional normalizing flows (CNFs), a class of NFs where the base density to output space mapping is conditioned on an input x, to model conditional densities p(y|x). CNFs are efficient in sampling and inference, they can be trained with a likelihood-based objective, and CNFs, being generative flows, do not suffer from mode collapse or training instabilities. We provide an effective method to train continuous CNFs for binary problems and in particular, we apply these CNFs to super-resolution and vessel segmentation tasks demonstrating competitive performance on standard benchmark datasets in terms of likelihood and conventional metrics.  ( 2 min )
    Testing of Deep Reinforcement Learning Agents with Surrogate Models. (arXiv:2305.12751v2 [cs.SE] UPDATED)
    Deep Reinforcement Learning (DRL) has received a lot of attention from the research community in recent years. As the technology moves away from game playing to practical contexts, such as autonomous vehicles and robotics, it is crucial to evaluate the quality of DRL agents. In this paper, we propose a search-based approach to test such agents. Our approach, implemented in a tool called Indago, trains a classifier on failure and non-failure environment (i.e., pass) configurations resulting from the DRL training process. The classifier is used at testing time as a surrogate model for the DRL agent execution in the environment, predicting the extent to which a given environment configuration induces a failure of the DRL agent under test. The failure prediction acts as a fitness function, guiding the generation towards failure environment configurations, while saving computation time by deferring the execution of the DRL agent in the environment to those configurations that are more likely to expose failures. Experimental results show that our search-based approach finds 50% more failures of the DRL agent than state-of-the-art techniques. Moreover, such failures are, on average, 78% more diverse; similarly, the behaviors of the DRL agent induced by failure configurations are 74% more diverse.  ( 2 min )
  • Open

    Embarassingly Simple Dataset Distillation. (arXiv:2311.07025v1 [cs.LG])
    Dataset distillation extracts a small set of synthetic training samples from a large dataset with the goal of achieving competitive performance on test data when trained on this sample. In this work, we tackle dataset distillation at its core by treating it directly as a bilevel optimization problem. Re-examining the foundational back-propagation through time method, we study the pronounced variance in the gradients, computational burden, and long-term dependencies. We introduce an improved method: Random Truncated Backpropagation Through Time (RaT-BPTT) to address them. RaT-BPTT incorporates a truncation coupled with a random window, effectively stabilizing the gradients and speeding up the optimization while covering long dependencies. This allows us to establish new state-of-the-art for a variety of standard dataset benchmarks. A deeper dive into the nature of distilled data unveils pronounced intercorrelation. In particular, subsets of distilled datasets tend to exhibit much worse performance than directly distilled smaller datasets of the same size. Leveraging RaT-BPTT, we devise a boosting mechanism that generates distilled datasets that contain subsets with near optimal performance across different data budgets.
    Faster Algorithms for Structured Linear and Kernel Support Vector Machines. (arXiv:2307.07735v2 [math.OC] UPDATED)
    Quadratic programming is a ubiquitous prototype in convex programming. Many combinatorial optimizations on graphs and machine learning problems can be formulated as quadratic programming; for example, Support Vector Machines (SVMs). Linear and kernel SVMs have been among the most popular models in machine learning over the past three decades, prior to the deep learning era. Generally, a quadratic program has an input size of $\Theta(n^2)$, where $n$ is the number of variables. Assuming the Strong Exponential Time Hypothesis ($\textsf{SETH}$), it is known that no $O(n^{2-o(1)})$ algorithm exists (Backurs, Indyk, and Schmidt, NIPS'17). However, problems such as SVMs usually feature much smaller input sizes: one is given $n$ data points, each of dimension $d$, with $d \ll n$. Furthermore, SVMs are variants with only $O(1)$ linear constraints. This suggests that faster algorithms are feasible, provided the program exhibits certain underlying structures. In this work, we design the first nearly-linear time algorithm for solving quadratic programs whenever the quadratic objective has small treewidth or admits a low-rank factorization, and the number of linear constraints is small. Consequently, we obtain a variety of results for SVMs: * For linear SVM, where the quadratic constraint matrix has treewidth $\tau$, we can solve the corresponding program in time $\widetilde O(n\tau^{(\omega+1)/2}\log(1/\epsilon))$; * For linear SVM, where the quadratic constraint matrix admits a low-rank factorization of rank-$k$, we can solve the corresponding program in time $\widetilde O(nk^{(\omega+1)/2}\log(1/\epsilon))$; * For Gaussian kernel SVM, where the data dimension $d = \Theta(\log n)$ and the squared dataset radius is small, we can solve it in time $O(n^{1+o(1)}\log(1/\epsilon))$. We also prove that when the squared dataset radius is large, then $\Omega(n^{2-o(1)})$ time is required.
    Explaining black boxes with a SMILE: Statistical Model-agnostic Interpretability with Local Explanations. (arXiv:2311.07286v1 [cs.LG])
    Machine learning is currently undergoing an explosion in capability, popularity, and sophistication. However, one of the major barriers to widespread acceptance of machine learning (ML) is trustworthiness: most ML models operate as black boxes, their inner workings opaque and mysterious, and it can be difficult to trust their conclusions without understanding how those conclusions are reached. Explainability is therefore a key aspect of improving trustworthiness: the ability to better understand, interpret, and anticipate the behaviour of ML models. To this end, we propose SMILE, a new method that builds on previous approaches by making use of statistical distance measures to improve explainability while remaining applicable to a wide range of input data domains.
    Learning Likelihoods with Conditional Normalizing Flows. (arXiv:1912.00042v2 [cs.LG] UPDATED)
    Normalizing Flows (NFs) are able to model complicated distributions p(y) with strong inter-dimensional correlations and high multimodality by transforming a simple base density p(z) through an invertible neural network under the change of variables formula. Such behavior is desirable in multivariate structured prediction tasks, where handcrafted per-pixel loss-based methods inadequately capture strong correlations between output dimensions. We present a study of conditional normalizing flows (CNFs), a class of NFs where the base density to output space mapping is conditioned on an input x, to model conditional densities p(y|x). CNFs are efficient in sampling and inference, they can be trained with a likelihood-based objective, and CNFs, being generative flows, do not suffer from mode collapse or training instabilities. We provide an effective method to train continuous CNFs for binary problems and in particular, we apply these CNFs to super-resolution and vessel segmentation tasks demonstrating competitive performance on standard benchmark datasets in terms of likelihood and conventional metrics.
    Physics-Informed Data Denoising for Real-Life Sensing Systems. (arXiv:2311.06968v1 [cs.LG])
    Sensors measuring real-life physical processes are ubiquitous in today's interconnected world. These sensors inherently bear noise that often adversely affects performance and reliability of the systems they support. Classic filtering-based approaches introduce strong assumptions on the time or frequency characteristics of sensory measurements, while learning-based denoising approaches typically rely on using ground truth clean data to train a denoising model, which is often challenging or prohibitive to obtain for many real-world applications. We observe that in many scenarios, the relationships between different sensor measurements (e.g., location and acceleration) are analytically described by laws of physics (e.g., second-order differential equation). By incorporating such physics constraints, we can guide the denoising process to improve even in the absence of ground truth data. In light of this, we design a physics-informed denoising model that leverages the inherent algebraic relationships between different measurements governed by the underlying physics. By obviating the need for ground truth clean data, our method offers a practical denoising solution for real-world applications. We conducted experiments in various domains, including inertial navigation, CO2 monitoring, and HVAC control, and achieved state-of-the-art performance compared with existing denoising methods. Our method can denoise data in real time (4ms for a sequence of 1s) for low-cost noisy sensors and produces results that closely align with those from high-precision, high-cost alternatives, leading to an efficient, cost-effective approach for more accurate sensor-based systems.
    Exactly Tight Information-Theoretic Generalization Error Bound for the Quadratic Gaussian Problem. (arXiv:2305.00876v2 [cs.IT] UPDATED)
    We provide a new information-theoretic generalization error bound that is exactly tight (i.e., matching even the constant) for the canonical quadratic Gaussian (location) problem. Most existing bounds are order-wise loose in this setting, which has raised concerns about the fundamental capability of information-theoretic bounds in reasoning the generalization behavior for machine learning. The proposed new bound adopts the individual-sample-based approach proposed by Bu et al., but also has several key new ingredients. Firstly, instead of applying the change of measure inequality on the loss function, we apply it to the generalization error function itself; secondly, the bound is derived in a conditional manner; lastly, a reference distribution is introduced. The combination of these components produces a KL-divergence-based generalization error bound. We show that although the latter two new ingredients can help make the bound exactly tight, removing them does not significantly degrade the bound, leading to an asymptotically tight mutual-information-based bound. We further consider the vector Gaussian setting, where a direct application of the proposed bound again does not lead to tight bounds except in special cases. A refined bound is then proposed for decomposable loss functions, leading to a tight bound for the vector setting.
    Machine learning for uncertainty estimation in fusing precipitation observations from satellites and ground-based gauges. (arXiv:2311.07511v1 [stat.ML])
    To form precipitation datasets that are accurate and, at the same time, have high spatial densities, data from satellites and gauges are often merged in the literature. However, uncertainty estimates for the data acquired in this manner are scarcely provided, although the importance of uncertainty quantification in predictive modelling is widely recognized. Furthermore, the benefits that machine learning can bring to the task of providing such estimates have not been broadly realized and properly explored through benchmark experiments. The present study aims at filling in this specific gap by conducting the first benchmark tests on the topic. On a large dataset that comprises 15-year-long monthly data spanning across the contiguous United States, we extensively compared six learners that are, by their construction, appropriate for predictive uncertainty quantification. These are the quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machines (LightGBM) and quantile regression neural networks (QRNN). The comparison referred to the competence of the learners in issuing predictive quantiles at nine levels that facilitate a good approximation of the entire predictive probability distribution, and was primarily based on the quantile and continuous ranked probability skill scores. Three types of predictor variables (i.e., satellite precipitation variables, distances between a point of interest and satellite grid points, and elevation at a point of interest) were used in the comparison and were additionally compared with each other. This additional comparison was based on the explainable machine learning concept of feature importance. The results suggest that the order from the best to the worst of the learners for the task investigated is the following: LightGBM, QRF, GRF, GBM, QRNN and QR...
    Brownian Noise Reduction: Maximizing Privacy Subject to Accuracy Constraints. (arXiv:2206.07234v4 [cs.LG] UPDATED)
    There is a disconnect between how researchers and practitioners handle privacy-utility tradeoffs. Researchers primarily operate from a privacy first perspective, setting strict privacy requirements and minimizing risk subject to these constraints. Practitioners often desire an accuracy first perspective, possibly satisfied with the greatest privacy they can get subject to obtaining sufficiently small error. Ligett et al. have introduced a "noise reduction" algorithm to address the latter perspective. The authors show that by adding correlated Laplace noise and progressively reducing it on demand, it is possible to produce a sequence of increasingly accurate estimates of a private parameter while only paying a privacy cost for the least noisy iterate released. In this work, we generalize noise reduction to the setting of Gaussian noise, introducing the Brownian mechanism. The Brownian mechanism works by first adding Gaussian noise of high variance corresponding to the final point of a simulated Brownian motion. Then, at the practitioner's discretion, noise is gradually decreased by tracing back along the Brownian path to an earlier time. Our mechanism is more naturally applicable to the common setting of bounded $\ell_2$-sensitivity, empirically outperforms existing work on common statistical tasks, and provides customizable control of privacy loss over the entire interaction with the practitioner. We complement our Brownian mechanism with ReducedAboveThreshold, a generalization of the classical AboveThreshold algorithm that provides adaptive privacy guarantees. Overall, our results demonstrate that one can meet utility constraints while still maintaining strong levels of privacy.
    Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage. (arXiv:2302.02392v2 [cs.LG] UPDATED)
    In offline reinforcement learning (RL) we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, taking the form of assuming some coverage, realizability, Bellman completeness, and/or hard margin (gap). In this work we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage of just a single comparator policy, and realizability of soft (entropy-regularized) Q-function of the single policy and a related function defined as a saddle point of certain minimax optimization problem. This offers refined and generally more lax conditions for offline RL. We further show an analogous result for vanilla Q-functions under a soft margin condition. To attain these guarantees, we leverage novel minimax learning algorithms to accurately estimate soft or vanilla Q-functions with $L^2$-convergence guarantees. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
    Automatic Identification of Driving Maneuver Patterns using a Robust Hidden Semi-Markov Models. (arXiv:2311.07527v1 [stat.ML])
    There is an increase in interest to model driving maneuver patterns via the automatic unsupervised clustering of naturalistic sequential kinematic driving data. The patterns learned are often used in transportation research areas such as eco-driving, road safety, and intelligent vehicles. One such model capable of modeling these patterns is the Hierarchical Dirichlet Process Hidden Semi-Markov Model (HDP-HSMM), as it is often used to estimate data segmentation, state duration, and transition probabilities. While this model is a powerful tool for automatically clustering observed sequential data, the existing HDP-HSMM estimation suffers from an inherent tendency to overestimate the number of states. This can result in poor estimation, which can potentially impact impact transportation research through incorrect inference of driving patterns. In this paper, a new robust HDP-HSMM (rHDP-HSMM) method is proposed to reduce the number of redundant states and improve the consistency of the model's estimation. Both a simulation study and a case study using naturalistic driving data are presented to demonstrate the effectiveness of the proposed rHDP-HSMM in identifying and inference of driving maneuver patterns.
    Explicit Foundation Model Optimization with Self-Attentive Feed-Forward Neural Units. (arXiv:2311.07510v1 [cs.LG])
    Iterative approximation methods using backpropagation enable the optimization of neural networks, but they remain computationally expensive, especially when used at scale. This paper presents an efficient alternative for optimizing neural networks that reduces the costs of scaling neural networks and provides high-efficiency optimizations for low-resource applications. We will discuss a general result about feed-forward neural networks and then extend this solution to compositional (mult-layer) networks, which are applied to a simplified transformer block containing feed-forward and self-attention layers. These models are used to train highly-specified and complex multi-layer neural architectures that we refer to as self-attentive feed-forward unit (SAFFU) layers, which we use to develop a transformer that appears to generalize well over small, cognitively-feasible, volumes of data. Testing demonstrates explicit solutions outperform models optimized by backpropagation alone. Moreover, further application of backpropagation after explicit solutions leads to better optima from smaller scales of data, training effective models from much less data is enabled by explicit solution warm starts. We then carry out ablation experiments training a roadmap of about 250 transformer models over 1-million tokens to determine ideal settings. We find that multiple different architectural variants produce highly-performant models, and discover from this ablation that some of the best are not the most parameterized. This appears to indicate well-generalized models could be reached using less data by using explicit solutions, and that architectural exploration using explicit solutions pays dividends in guiding the search for efficient variants with fewer parameters, and which could be incorporated into low-resource hardware where AI might be embodied.
    Joint Group Invariant Functions on Data-Parameter Domain Induce Universal Neural Networks. (arXiv:2310.03530v2 [cs.LG] UPDATED)
    The symmetry and geometry of input data are considered to be encoded in the internal data representation inside the neural network, but the specific encoding rule has been less investigated. In this study, we present a systematic method to induce a generalized neural network and its right inverse operator, called the ridgelet transform, from a joint group invariant function on the data-parameter domain. Since the ridgelet transform is an inverse, (1) it can describe the arrangement of parameters for the network to represent a target function, which is understood as the encoding rule, and (2) it implies the universality of the network. Based on the group representation theory, we present a new simple proof of the universality by using Schur's lemma in a unified manner covering a wide class of networks, for example, the original ridgelet transform, formal deep networks, and the dual voice transform. Since traditional universality theorems were demonstrated based on functional analysis, this study sheds light on the group theoretic aspect of the approximation theory, connecting geometric deep learning to abstract harmonic analysis.  ( 2 min )
    On Dynamic Pricing with Covariates. (arXiv:2112.13254v3 [cs.LG] UPDATED)
    We consider dynamic pricing with covariates under a generalized linear demand model: a seller can dynamically adjust the price of a product over a horizon of $T$ time periods, and at each time period $t$, the demand of the product is jointly determined by the price and an observable covariate vector $x_t\in\mathbb{R}^d$ through a generalized linear model with unknown co-efficients. Most of the existing literature assumes the covariate vectors $x_t$'s are independently and identically distributed (i.i.d.); the few papers that relax this assumption either sacrifice model generality or yield sub-optimal regret bounds. In this paper, we show that UCB and Thompson sampling-based pricing algorithms can achieve an $O(d\sqrt{T}\log T)$ regret upper bound without assuming any statistical structure on the covariates $x_t$. Our upper bound on the regret matches the lower bound up to logarithmic factors. We thus show that (i) the i.i.d. assumption is not necessary for obtaining low regret, and (ii) the regret bound can be independent of the (inverse) minimum eigenvalue of the covariance matrix of the $x_t$'s, a quantity present in previous bounds. Moreover, we consider a constrained setting of the dynamic pricing problem where there is a limited and unreplenishable inventory and we develop theoretical results that relate the best achievable algorithm performance to a variation measure with respect to the temporal distribution shift of the covariates. We also discuss conditions under which a better regret is achievable and demonstrate the proposed algorithms' performance with numerical experiments.  ( 3 min )
    Deep Ridgelet Transform: Voice with Koopman Operator Proves Universality of Formal Deep Networks. (arXiv:2310.03529v2 [cs.LG] UPDATED)
    We identify hidden layers inside a deep neural network (DNN) with group actions on the data domain, and formulate a formal deep network as a dual voice transform with respect to the Koopman operator, a linear representation of the group action. Based on the group theoretic arguments, particularly by using Schur's lemma, we show a simple proof of the universality of DNNs.  ( 2 min )
    Reducing the Need for Backpropagation and Discovering Better Optima With Explicit Optimizations of Neural Networks. (arXiv:2311.07498v1 [cs.LG])
    Iterative differential approximation methods that rely upon backpropagation have enabled the optimization of neural networks; however, at present, they remain computationally expensive, especially when training models at scale. In this paper, we propose a computationally efficient alternative for optimizing neural networks that can both reduce the costs of scaling neural networks and provide high-efficiency optimizations for low-resource applications. We derive an explicit solution to a simple feed-forward language model (LM) by mathematically analyzing its gradients. This solution generalizes from single-layer LMs to the class of all single-layer feed-forward softmax-activated neural models trained on positive-valued features, as is demonstrated by our extension of this solution application to MNIST digit classification. For both LM and digit classifiers, we find computationally that explicit solutions perform near-optimality in experiments showing that 1) iterative optimization only marginally improves the explicit solution parameters and 2) randomly initialized parameters iteratively optimize towards the explicit solution. We also preliminarily apply the explicit solution locally by layer in multi-layer networks and discuss how the solution's computational savings increase with model complexity -- for both single- and mult-layer applications of the explicit solution, we emphasize that the optima achieved cannot be reached by backpropagation alone, i.e., better optima appear discoverable only after explicit solutions are applied. Finally, we discuss the solution's computational savings alongside its impact on model interpretability and suggest future directions for the derivation of explicit solutions to complex- and multi-layer architectures.  ( 3 min )
    Explainable Boosting Machines with Sparsity -- Maintaining Explainability in High-Dimensional Settings. (arXiv:2311.07452v1 [stat.ML])
    Compared to "black-box" models, like random forests and deep neural networks, explainable boosting machines (EBMs) are considered "glass-box" models that can be competitively accurate while also maintaining a higher degree of transparency and explainability. However, EBMs become readily less transparent and harder to interpret in high-dimensional settings with many predictor variables; they also become more difficult to use in production due to increases in scoring time. We propose a simple solution based on the least absolute shrinkage and selection operator (LASSO) that can help introduce sparsity by reweighting the individual model terms and removing the less relevant ones, thereby allowing these models to maintain their transparency and relatively fast scoring times in higher-dimensional settings. In short, post-processing a fitted EBM with many (i.e., possibly hundreds or thousands) of terms using the LASSO can help reduce the model's complexity and drastically improve scoring time. We illustrate the basic idea using two real-world examples with code.  ( 2 min )
    Sparsely Activated Networks. (arXiv:1907.06592v10 [cs.LG] UPDATED)
    Previous literature on unsupervised learning focused on designing structural priors with the aim of learning meaningful features. However, this was done without considering the description length of the learned representations which is a direct and unbiased measure of the model complexity. In this paper, first we introduce the $\varphi$ metric that evaluates unsupervised models based on their reconstruction accuracy and the degree of compression of their internal representations. We then present and define two activation functions (Identity, ReLU) as base of reference and three sparse activation functions (top-k absolutes, Extrema-Pool indices, Extrema) as candidate structures that minimize the previously defined $\varphi$. We lastly present Sparsely Activated Networks (SANs) that consist of kernels with shared weights that, during encoding, are convolved with the input and then passed through a sparse activation function. During decoding, the same weights are convolved with the sparse activation map and subsequently the partial reconstructions from each weight are summed to reconstruct the input. We compare SANs using the five previously defined activation functions on a variety of datasets (Physionet, UCI-epilepsy, MNIST, FMNIST) and show that models that are selected using $\varphi$ have small description representation length and consist of interpretable kernels.  ( 3 min )
    Mixed Semi-Supervised Generalized-Linear-Regression with applications to Deep-Learning and Interpolators. (arXiv:2302.09526v2 [stat.ME] UPDATED)
    We present a methodology for using unlabeled data to design semi supervised learning (SSL) methods that improve the prediction performance of supervised learning for regression tasks. The main idea is to design different mechanisms for integrating the unlabeled data, and include in each of them a mixing parameter $\alpha$, controlling the weight given to the unlabeled data. Focusing on Generalized Linear Models (GLM) and linear interpolators classes of models, we analyze the characteristics of different mixing mechanisms, and prove that in all cases, it is invariably beneficial to integrate the unlabeled data with some nonzero mixing ratio $\alpha>0$, in terms of predictive performance. Moreover, we provide a rigorous framework to estimate the best mixing ratio $\alpha^*$ where mixed SSL delivers the best predictive performance, while using the labeled and unlabeled data on hand. The effectiveness of our methodology in delivering substantial improvement compared to the standard supervised models, in a variety of settings, is demonstrated empirically through extensive simulation, in a manner that supports the theoretical analysis. We also demonstrate the applicability of our methodology (with some intuitive modifications) to improve more complex models, such as deep neural networks, in real-world regression tasks.  ( 2 min )
    FEMDA: a unified framework for discriminant analysis. (arXiv:2311.07518v1 [stat.ML])
    Although linear and quadratic discriminant analysis are widely recognized classical methods, they can encounter significant challenges when dealing with non-Gaussian distributions or contaminated datasets. This is primarily due to their reliance on the Gaussian assumption, which lacks robustness. We first explain and review the classical methods to address this limitation and then present a novel approach that overcomes these issues. In this new approach, the model considered is an arbitrary Elliptically Symmetrical (ES) distribution per cluster with its own arbitrary scale parameter. This flexible model allows for potentially diverse and independent samples that may not follow identical distributions. By deriving a new decision rule, we demonstrate that maximum-likelihood parameter estimation and classification are simple, efficient, and robust compared to state-of-the-art methods.
    Generating Personalized Insulin Treatments Strategies with Deep Conditional Generative Time Series Models. (arXiv:2309.16521v2 [stat.ML] UPDATED)
    We propose a novel framework that combines deep generative time series models with decision theory for generating personalized treatment strategies. It leverages historical patient trajectory data to jointly learn the generation of realistic personalized treatment and future outcome trajectories through deep generative time series models. In particular, our framework enables the generation of novel multivariate treatment strategies tailored to the personalized patient history and trained for optimal expected future outcomes based on conditional expected utility maximization. We demonstrate our framework by generating personalized insulin treatment strategies and blood glucose predictions for hospitalized diabetes patients, showcasing the potential of our approach for generating improved personalized treatment strategies. Keywords: deep generative model, probabilistic decision support, personalized treatment generation, insulin and blood glucose prediction
    Bias-inducing geometries: an exactly solvable data model with fairness implications. (arXiv:2205.15935v3 [cs.LG] UPDATED)
    Machine learning (ML) may be oblivious to human bias but it is not immune to its perpetuation. Marginalisation and iniquitous group representation are often traceable in the very data used for training, and may be reflected or even enhanced by the learning models. In the present work, we aim at clarifying the role played by data geometry in the emergence of ML bias. We introduce an exactly solvable high-dimensional model of data imbalance, where parametric control over the many bias-inducing factors allows for an extensive exploration of the bias inheritance mechanism. Through the tools of statistical physics, we analytically characterise the typical properties of learning models trained in this synthetic framework and obtain exact predictions for the observables that are commonly employed for fairness assessment. Despite the simplicity of the data model, we retrace and unpack typical unfairness behaviour observed on real-world datasets. We also obtain a detailed analytical characterisation of a class of bias mitigation strategies. We first consider a basic loss-reweighing scheme, which allows for an implicit minimisation of different unfairness metrics, and quantify the incompatibilities between some existing fairness criteria. Then, we consider a novel mitigation strategy based on a matched inference approach, consisting in the introduction of coupled learning models. Our theoretical analysis of this approach shows that the coupled strategy can strike superior fairness-accuracy trade-offs.  ( 3 min )
    ALPCAH: Sample-wise Heteroscedastic PCA with Tail Singular Value Regularization. (arXiv:2307.02745v2 [stat.ML] UPDATED)
    Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction that is useful for various data science problems. However, many applications involve heterogeneous data that varies in quality due to noise characteristics associated with different sources of the data. Methods that deal with this mixed dataset are known as heteroscedastic methods. Current methods like HePPCAT make Gaussian assumptions of the basis coefficients that may not hold in practice. Other methods such as Weighted PCA (WPCA) assume the noise variances are known, which may be difficult to know in practice. This paper develops a PCA method that can estimate the sample-wise noise variances and use this information in the model to improve the estimate of the subspace basis associated with the low-rank structure of the data. This is done without distributional assumptions of the low-rank component and without assuming the noise variances are known. Simulations show the effectiveness of accounting for such heteroscedasticity in the data, the benefits of using such a method with all of the data versus retaining only good data, and comparisons are made against other PCA methods established in the literature like PCA, Robust PCA (RPCA), and HePPCAT. Code available at https://github.com/javiersc1/ALPCAH  ( 3 min )
    Reformulating van Rijsbergen's $F_{\beta}$ metric for weighted binary cross-entropy. (arXiv:2210.16458v3 [stat.ML] UPDATED)
    The separation of performance metrics from gradient based loss functions may not always give optimal results and may miss vital aggregate information. This paper investigates incorporating a performance metric alongside differentiable loss functions to inform training outcomes. The goal is to guide model performance and interpretation by assuming statistical distributions on this performance metric for dynamic weighting. The focus is on van Rijsbergens $F_{\beta}$ metric -- a popular choice for gauging classification performance. Through distributional assumptions on the $F_{\beta}$, an intermediary link can be established to the standard binary cross-entropy via dynamic penalty weights. First, the $F_{\beta}$ metric is reformulated to facilitate assuming statistical distributions with accompanying proofs for the cumulative density function. These probabilities are used within a knee curve algorithm to find an optimal $\beta$ or $\beta_{opt}$. This $\beta_{opt}$ is used as a weight or penalty in the proposed weighted binary cross-entropy. Experimentation on publicly available data along with benchmark analysis mostly yields better and interpretable results as compared to the baseline for both imbalanced and balanced classes. For example, for the IMDB text data with known labeling errors, a 14% boost in $F_1$ score is shown. The results also reveal commonalities between the penalty model families derived in this paper and the suitability of recall-centric or precision-centric parameters used in the optimization. The flexibility of this methodology can enhance interpretation.  ( 2 min )
    Provably Convergent Plug-and-Play Quasi-Newton Methods. (arXiv:2303.07271v4 [math.OC] UPDATED)
    Plug-and-Play (PnP) methods are a class of efficient iterative methods that aim to combine data fidelity terms and deep denoisers using classical optimization algorithms, such as ISTA or ADMM, with applications in inverse problems and imaging. Provable PnP methods are a subclass of PnP methods with convergence guarantees, such as fixed point convergence or convergence to critical points of some energy function. Many existing provable PnP methods impose heavy restrictions on the denoiser or fidelity function, such as non-expansiveness or strict convexity, respectively. In this work, we propose a novel algorithmic approach incorporating quasi-Newton steps into a provable PnP framework based on proximal denoisers, resulting in greatly accelerated convergence while retaining light assumptions on the denoiser. By characterizing the denoiser as the proximal operator of a weakly convex function, we show that the fixed points of the proposed quasi-Newton PnP algorithm are critical points of a weakly convex function. Numerical experiments on image deblurring and super-resolution demonstrate 2--8x faster convergence as compared to other provable PnP methods with similar reconstruction quality.  ( 2 min )
    Non-approximability of constructive global $\mathcal{L}^2$ minimizers by gradient descent in Deep Learning. (arXiv:2311.07065v1 [cs.LG])
    We analyze geometric aspects of the gradient descent algorithm in Deep Learning (DL) networks. In particular, we prove that the globally minimizing weights and biases for the $\mathcal{L}^2$ cost obtained constructively in [Chen-Munoz Ewald 2023] for underparametrized ReLU DL networks can generically not be approximated via the gradient descent flow. We therefore conclude that the method introduced in [Chen-Munoz Ewald 2023] is disjoint from the gradient descent method.  ( 2 min )
    Towards the Fundamental Limits of Knowledge Transfer over Finite Domains. (arXiv:2310.07838v3 [cs.LG] UPDATED)
    We characterize the statistical efficiency of knowledge transfer through $n$ samples from a teacher to a probabilistic student classifier with input space $\mathcal S$ over labels $\mathcal A$. We show that privileged information at three progressive levels accelerates the transfer. At the first level, only samples with hard labels are known, via which the maximum likelihood estimator attains the minimax rate $\sqrt{{|{\mathcal S}||{\mathcal A}|}/{n}}$. The second level has the teacher probabilities of sampled labels available in addition, which turns out to boost the convergence rate lower bound to ${{|{\mathcal S}||{\mathcal A}|}/{n}}$. However, under this second data acquisition protocol, minimizing a naive adaptation of the cross-entropy loss results in an asymptotically biased student. We overcome this limitation and achieve the fundamental limit by using a novel empirical variant of the squared error logit loss. The third level further equips the student with the soft labels (complete logits) on ${\mathcal A}$ given every sampled input, thereby provably enables the student to enjoy a rate ${|{\mathcal S}|}/{n}$ free of $|{\mathcal A}|$. We find any Kullback-Leibler divergence minimizer to be optimal in the last case. Numerical simulations distinguish the four learners and corroborate our theory.  ( 3 min )
    Human-Aligned Calibration for AI-Assisted Decision Making. (arXiv:2306.00074v3 [cs.LG] UPDATED)
    Whenever a binary classifier is used to provide decision support, it typically provides both a label prediction and a confidence value. Then, the decision maker is supposed to use the confidence value to calibrate how much to trust the prediction. In this context, it has been often argued that the confidence value should correspond to a well calibrated estimate of the probability that the predicted label matches the ground truth label. However, multiple lines of empirical evidence suggest that decision makers have difficulties at developing a good sense on when to trust a prediction using these confidence values. In this paper, our goal is first to understand why and then investigate how to construct more useful confidence values. We first argue that, for a broad class of utility functions, there exist data distributions for which a rational decision maker is, in general, unlikely to discover the optimal decision policy using the above confidence values -- an optimal decision maker would need to sometimes place more (less) trust on predictions with lower (higher) confidence values. However, we then show that, if the confidence values satisfy a natural alignment property with respect to the decision maker's confidence on her own predictions, there always exists an optimal decision policy under which the level of trust the decision maker would need to place on predictions is monotone on the confidence values, facilitating its discoverability. Further, we show that multicalibration with respect to the decision maker's confidence on her own predictions is a sufficient condition for alignment. Experiments on four different AI-assisted decision making tasks where a classifier provides decision support to real human experts validate our theoretical results and suggest that alignment may lead to better decisions.  ( 3 min )
    FACT: High-Dimensional Random Forests Inference. (arXiv:2207.01678v2 [stat.ML] UPDATED)
    Quantifying the usefulness of individual features in random forests learning can greatly enhance its interpretability. Existing studies have shown that some popularly used feature importance measures for random forests suffer from the bias issue. In addition, there lack comprehensive size and power analyses for most of these existing methods. In this paper, we approach the problem via hypothesis testing, and suggest a framework of the self-normalized feature-residual correlation test (FACT) for evaluating the significance of a given feature in the random forests model with bias-resistance property, where our null hypothesis concerns whether the feature is conditionally independent of the response given all other features. Such an endeavor on random forests inference is empowered by some recent developments on high-dimensional random forests consistency. Under a fairly general high-dimensional nonparametric model setting with dependent features, we formally establish that FACT can provide theoretically justified feature importance test with controlled type I error and enjoy appealing power property. The theoretical results and finite-sample advantages of the newly suggested method are illustrated with several simulation examples and an economic forecasting application.  ( 2 min )
    Single-Call Stochastic Extragradient Methods for Structured Non-monotone Variational Inequalities: Improved Analysis under Weaker Conditions. (arXiv:2302.14043v2 [math.OC] UPDATED)
    Single-call stochastic extragradient methods, like stochastic past extragradient (SPEG) and stochastic optimistic gradient (SOG), have gained a lot of interest in recent years and are one of the most efficient algorithms for solving large-scale min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. However, despite their undoubted popularity, current convergence analyses of SPEG and SOG require a bounded variance assumption. In addition, several important questions regarding the convergence properties of these methods are still open, including mini-batching, efficient step-size selection, and convergence guarantees under different sampling strategies. In this work, we address these questions and provide convergence guarantees for two large classes of structured non-monotone VIPs: (i) quasi-strongly monotone problems (a generalization of strongly monotone problems) and (ii) weak Minty variational inequalities (a generalization of monotone and Minty VIPs). We introduce the expected residual condition, explain its benefits, and show how it can be used to obtain a strictly weaker bound than previously used growth conditions, expected co-coercivity, or bounded variance assumptions. Equipped with this condition, we provide theoretical guarantees for the convergence of single-call extragradient methods for different step-size selections, including constant, decreasing, and step-size-switching rules. Furthermore, our convergence analysis holds under the arbitrary sampling paradigm, which includes importance sampling and various mini-batching strategies as special cases.  ( 3 min )
    Tackling the Curse of Dimensionality with Physics-Informed Neural Networks. (arXiv:2307.12306v4 [cs.LG] UPDATED)
    The curse-of-dimensionality taxes computational resources heavily with exponentially increasing computational cost as the dimension increases. This poses great challenges in solving high-dimensional PDEs, as Richard E. Bellman first pointed out over 60 years ago. While there has been some recent success in solving numerically partial differential equations (PDEs) in high dimensions, such computations are prohibitively expensive, and true scaling of general nonlinear PDEs to high dimensions has never been achieved. We develop a new method of scaling up physics-informed neural networks (PINNs) to solve arbitrary high-dimensional PDEs. The new method, called Stochastic Dimension Gradient Descent (SDGD), decomposes a gradient of PDEs into pieces corresponding to different dimensions and randomly samples a subset of these dimensional pieces in each iteration of training PINNs. We prove theoretically the convergence and other desired properties of the proposed method. We demonstrate in various diverse tests that the proposed method can solve many notoriously hard high-dimensional PDEs, including the Hamilton-Jacobi-Bellman (HJB) and the Schr\"{o}dinger equations in tens of thousands of dimensions very fast on a single GPU using the PINNs mesh-free approach. Notably, we solve nonlinear PDEs with nontrivial, anisotropic, and inseparable solutions in 100,000 effective dimensions in 12 hours on a single GPU using SDGD with PINNs. Since SDGD is a general training methodology of PINNs, it can be applied to any current and future variants of PINNs to scale them up for arbitrary high-dimensional PDEs.  ( 3 min )
    A Large Deviations Perspective on Policy Gradient Algorithms. (arXiv:2311.07411v1 [math.OC])
    We derive the first large deviation rate function for the stochastic iterates generated by policy gradient methods with a softmax parametrization and an entropy regularized objective. Leveraging the contraction principle from large deviations theory, we also develop a general recipe for deriving exponential convergence rates for a wide spectrum of other policy parametrizations. This approach unifies several results from the literature and simplifies existing proof techniques.  ( 2 min )
    On Binscatter. (arXiv:1902.09608v4 [econ.EM] UPDATED)
    Binscatter is a popular method for visualizing bivariate relationships and conducting informal specification testing. We study the properties of this method formally and develop enhanced visualization and econometric binscatter tools. These include estimating conditional means with optimal binning and quantifying uncertainty. We also highlight a methodological problem related to covariate adjustment that can yield incorrect conclusions. We revisit two applications using our methodology and find substantially different results relative to those obtained using prior informal binscatter methods. General purpose software in Python, R, and Stata is provided. Our technical work is of independent interest for the nonparametric partition-based estimation literature.  ( 2 min )
    How do Minimum-Norm Shallow Denoisers Look in Function Space?. (arXiv:2311.06748v1 [stat.ML])
    Neural network (NN) denoisers are an essential building block in many common tasks, ranging from image reconstruction to image generation. However, the success of these models is not well understood from a theoretical perspective. In this paper, we aim to characterize the functions realized by shallow ReLU NN denoisers -- in the common theoretical setting of interpolation (i.e., zero training loss) with a minimal representation cost (i.e., minimal $\ell^2$ norm weights). First, for univariate data, we derive a closed form for the NN denoiser function, find it is contractive toward the clean data points, and prove it generalizes better than the empirical MMSE estimator at a low noise level. Next, for multivariate data, we find the NN denoiser functions in a closed form under various geometric assumptions on the training data: data contained in a low-dimensional subspace, data contained in a union of one-sided rays, or several types of simplexes. These functions decompose into a sum of simple rank-one piecewise linear interpolations aligned with edges and/or faces connecting training samples. We empirically verify this alignment phenomenon on synthetic data and real images.  ( 2 min )
    The Exact Determinant of a Specific Class of Sparse Positive Definite Matrices. (arXiv:2311.06632v1 [stat.ML])
    For a specific class of sparse Gaussian graphical models, we provide a closed-form solution for the determinant of the covariance matrix. In our framework, the graphical interaction model (i.e., the covariance selection model) is equal to replacement product of $\mathcal{K}_{n}$ and $\mathcal{K}_{n-1}$, where $\mathcal{K}_n$ is the complete graph with $n$ vertices. Our analysis is based on taking the Fourier transform of the local factors of the model, which can be viewed as an application of the Normal Factor Graph Duality Theorem and holographic algorithms. The closed-form expression is obtained by applying the Matrix Determinant Lemma on the transformed graphical model. In this context, we will also define a notion of equivalence between two Gaussian graphical models.  ( 2 min )
    Agnostic Membership Query Learning with Nontrivial Savings: New Results, Techniques. (arXiv:2311.06690v1 [cs.LG])
    (Abridged) Designing computationally efficient algorithms in the agnostic learning model (Haussler, 1992; Kearns et al., 1994) is notoriously difficult. In this work, we consider agnostic learning with membership queries for touchstone classes at the frontier of agnostic learning, with a focus on how much computation can be saved over the trivial runtime of 2^n$. This approach is inspired by and continues the study of ``learning with nontrivial savings'' (Servedio and Tan, 2017). To this end, we establish multiple agnostic learning algorithms, highlighted by: 1. An agnostic learning algorithm for circuits consisting of a sublinear number of gates, which can each be any function computable by a sublogarithmic degree k polynomial threshold function (the depth of the circuit is bounded only by size). This algorithm runs in time 2^{n -s(n)} for s(n) \approx n/(k+1), and learns over the uniform distribution over unlabelled examples on \{0,1\}^n. 2. An agnostic learning algorithm for circuits consisting of a sublinear number of gates, where each can be any function computable by a \sym^+ circuit of subexponential size and sublogarithmic degree k. This algorithm runs in time 2^{n-s(n)} for s(n) \approx n/(k+1), and learns over distributions of unlabelled examples that are products of k+1 arbitrary and unknown distributions, each over \{0,1\}^{n/(k+1)} (assume without loss of generality that k+1 divides n).  ( 2 min )
    Anchor Data Augmentation. (arXiv:2311.06965v1 [cs.LG])
    We propose a novel algorithm for data augmentation in nonlinear over-parametrized regression. Our data augmentation algorithm borrows from the literature on causality and extends the recently proposed Anchor regression (AR) method for data augmentation, which is in contrast to the current state-of-the-art domain-agnostic solutions that rely on the Mixup literature. Our Anchor Data Augmentation (ADA) uses several replicas of the modified samples in AR to provide more training examples, leading to more robust regression predictions. We apply ADA to linear and nonlinear regression problems using neural networks. ADA is competitive with state-of-the-art C-Mixup solutions.  ( 2 min )
    A PAC-Bayesian Perspective on the Interpolating Information Criterion. (arXiv:2311.07013v1 [stat.ML])
    Deep learning is renowned for its theory-practice gap, whereby principled theory typically fails to provide much beneficial guidance for implementation in practice. This has been highlighted recently by the benign overfitting phenomenon: when neural networks become sufficiently large to interpolate the dataset perfectly, model performance appears to improve with increasing model size, in apparent contradiction with the well-known bias-variance tradeoff. While such phenomena have proven challenging to theoretically study for general models, the recently proposed Interpolating Information Criterion (IIC) provides a valuable theoretical framework to examine performance for overparameterized models. Using the IIC, a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence generalization performance in the interpolating regime. From the provided bound, we quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, optimizer, and parameter-initialization scheme; the spectrum of the empirical neural tangent kernel; curvature of the loss landscape; and noise present in the data.  ( 2 min )
    Exploration via linearly perturbed loss minimisation. (arXiv:2311.07565v1 [cs.LG])
    We introduce exploration via linear loss perturbations (EVILL), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative log-likelihood function. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. With the data-dependent perturbations we propose, not present in previous PHE-type methods, EVILL is shown to match the performance of Thompson-sampling-style parameter-perturbation methods, both in theory and in practice. Moreover, we show an example outside of generalised linear bandits where PHE leads to inconsistent estimates, and thus linear regret, while EVILL remains performant. Like PHE, EVILL can be implemented in just a few lines of code.  ( 2 min )
    Learning nonparametric ordinary differential equations from noisy data. (arXiv:2206.15215v3 [stat.ML] UPDATED)
    Learning nonparametric systems of Ordinary Differential Equations (ODEs) dot x = f(t,x) from noisy data is an emerging machine learning topic. We use the well-developed theory of Reproducing Kernel Hilbert Spaces (RKHS) to define candidates for f for which the solution of the ODE exists and is unique. Learning f consists of solving a constrained optimization problem in an RKHS. We propose a penalty method that iteratively uses the Representer theorem and Euler approximations to provide a numerical solution. We prove a generalization bound for the L2 distance between x and its estimator and provide experimental comparisons with the state-of-the-art.  ( 2 min )
    Modulated Neural ODEs. (arXiv:2302.13262v3 [cs.LG] UPDATED)
    Neural ordinary differential equations (NODEs) have been proven useful for learning non-linear dynamics of arbitrary trajectories. However, current NODE methods capture variations across trajectories only via the initial state value or by auto-regressive encoder updates. In this work, we introduce Modulated Neural ODEs (MoNODEs), a novel framework that sets apart dynamics states from underlying static factors of variation and improves the existing NODE methods. In particular, we introduce $\textit{time-invariant modulator variables}$ that are learned from the data. We incorporate our proposed framework into four existing NODE variants. We test MoNODE on oscillating systems, videos and human walking trajectories, where each trajectory has trajectory-specific modulation. Our framework consistently improves the existing model ability to generalize to new dynamic parameterizations and to perform far-horizon forecasting. In addition, we verify that the proposed modulator variables are informative of the true unknown factors of variation as measured by $R^2$ scores.  ( 2 min )
    Asymptotic in a class of network models with an increasing sub-Gamma degree sequence. (arXiv:2111.01301v4 [math.ST] UPDATED)
    For the differential privacy under the sub-Gamma noise, we derive the asymptotic properties of a class of network models with binary values with a general link function. In this paper, we release the degree sequences of the binary networks under a general noisy mechanism with the discrete Laplace mechanism as a special case. We establish the asymptotic result including both consistency and asymptotically normality of the parameter estimator when the number of parameters goes to infinity in a class of network models. Simulations and a real data example are provided to illustrate asymptotic results.  ( 2 min )
    Instance-Dependent Generalization Bounds via Optimal Transport. (arXiv:2211.01258v4 [stat.ML] UPDATED)
    Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks. Since such bounds often hold uniformly over all parameters, they suffer from over-parametrization and fail to account for the strong inductive bias of initialization and stochastic gradient descent. As an alternative, we propose a novel optimal transport interpretation of the generalization problem. This allows us to derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space. Therefore, our bounds are agnostic to the parametrization of the model and work well when the number of training samples is much smaller than the number of parameters. With small modifications, our approach yields accelerated rates for data on low-dimensional manifolds and guarantees under distribution shifts. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.  ( 2 min )
    Verifiable Learning for Robust Tree Ensembles. (arXiv:2305.03626v4 [cs.LG] UPDATED)
    Verifying the robustness of machine learning models against evasion attacks at test time is an important research problem. Unfortunately, prior work established that this problem is NP-hard for decision tree ensembles, hence bound to be intractable for specific inputs. In this paper, we identify a restricted class of decision tree ensembles, called large-spread ensembles, which admit a security verification algorithm running in polynomial time. We then propose a new approach called verifiable learning, which advocates the training of such restricted model classes which are amenable for efficient verification. We show the benefits of this idea by designing a new training algorithm that automatically learns a large-spread decision tree ensemble from labelled data, thus enabling its security verification in polynomial time. Experimental results on public datasets confirm that large-spread ensembles trained using our algorithm can be verified in a matter of seconds, using standard commercial hardware. Moreover, large-spread ensembles are more robust than traditional ensembles against evasion attacks, at the cost of an acceptable loss of accuracy in the non-adversarial setting.  ( 2 min )
    Sparse Density Trees and Lists: An Interpretable Alternative to High-Dimensional Histograms. (arXiv:1510.06779v4 [stat.ML] UPDATED)
    We present sparse tree-based and list-based density estimation methods for binary/categorical data. Our density estimation models are higher dimensional analogies to variable bin width histograms. In each leaf of the tree (or list), the density is constant, similar to the flat density within the bin of a histogram. Histograms, however, cannot easily be visualized in more than two dimensions, whereas our models can. The accuracy of histograms fades as dimensions increase, whereas our models have priors that help with generalization. Our models are sparse, unlike high-dimensional fixed-bin histograms. We present three generative modeling methods, where the first one allows the user to specify the preferred number of leaves in the tree within a Bayesian prior. The second method allows the user to specify the preferred number of branches within the prior. The third method returns density lists (rather than trees) and allows the user to specify the preferred number of rules and the length of rules within the prior. The new approaches often yield a better balance between sparsity and accuracy of density estimates than other methods for this task. We present an application to crime analysis, where we estimate how unusual each type of modus operandi is for a house break-in.  ( 3 min )
    Anytime Model Selection in Linear Bandits. (arXiv:2307.12897v2 [stat.ML] UPDATED)
    Model selection in the context of bandit optimization is a challenging problem, as it requires balancing exploration and exploitation not only for action selection, but also for model selection. One natural approach is to rely on online learning algorithms that treat different models as experts. Existing methods, however, scale poorly ($\text{poly}M$) with the number of models $M$ in terms of their regret. Our key insight is that, for model selection in linear bandits, we can emulate full-information feedback to the online learner with a favorable bias-variance trade-off. This allows us to develop ALEXP, which has an exponentially improved ($\log M$) dependence on $M$ for its regret. ALEXP has anytime guarantees on its regret, and neither requires knowledge of the horizon $n$, nor relies on an initial purely exploratory stage. Our approach utilizes a novel time-uniform analysis of the Lasso, establishing a new connection between online learning and high-dimensional statistics.  ( 2 min )
    ODTlearn: A Package for Learning Optimal Decision Trees for Prediction and Prescription. (arXiv:2307.15691v2 [stat.ML] UPDATED)
    ODTLearn is an open-source Python package that provides methods for learning optimal decision trees for high-stakes predictive and prescriptive tasks based on the mixed-integer optimization (MIO) framework proposed in Aghaei et al. (2019) and several of its extensions. The current version of the package provides implementations for learning optimal classification trees, optimal fair classification trees, optimal classification trees robust to distribution shifts, and optimal prescriptive trees from observational data. We have designed the package to be easy to maintain and extend as new optimal decision tree problem classes, reformulation strategies, and solution algorithms are introduced. To this end, the package follows object-oriented design principles and supports both commercial (Gurobi) and open source (COIN-OR branch and cut) solvers. The package documentation and an extensive user guide can be found at https://d3m-research-group.github.io/odtlearn/. Additionally, users can view the package source code and submit feature requests and bug reports by visiting https://github.com/D3M-Research-Group/odtlearn.  ( 2 min )
    A Federated Data Fusion-Based Prognostic Model for Applications with Multi-Stream Incomplete Signals. (arXiv:2311.07474v1 [stat.ML])
    Most prognostic methods require a decent amount of data for model training. In reality, however, the amount of historical data owned by a single organization might be small or not large enough to train a reliable prognostic model. To address this challenge, this article proposes a federated prognostic model that allows multiple users to jointly construct a failure time prediction model using their multi-stream, high-dimensional, and incomplete data while keeping each user's data local and confidential. The prognostic model first employs multivariate functional principal component analysis to fuse the multi-stream degradation signals. Then, the fused features coupled with the times-to-failure are utilized to build a (log)-location-scale regression model for failure prediction. To estimate parameters using distributed datasets and keep the data privacy of all participants, we propose a new federated algorithm for feature extraction. Numerical studies indicate that the performance of the proposed model is the same as that of classic non-federated prognostic models and is better than that of the models constructed by each user itself.  ( 2 min )
    Computerized Tomography and Reproducing Kernels. (arXiv:2311.07465v1 [math.FA])
    The X-ray transform is one of the most fundamental integral operators in image processing and reconstruction. In this article, we revisit its mathematical formalism, and propose an innovative approach making use of Reproducing Kernel Hilbert Spaces (RKHS). Within this framework, the X-ray transform can be considered as a natural analogue of Euclidean projections. The RKHS framework considerably simplifies projection image interpolation, and leads to an analogue of the celebrated representer theorem for the problem of tomographic reconstruction. It leads to methodology that is dimension-free and stands apart from conventional filtered back-projection techniques, as it does not hinge on the Fourier transform. It also allows us to establish sharp stability results at a genuinely functional level, but in the realistic setting where the data are discrete and noisy. The RKHS framework is amenable to any reproducing kernel on a unit ball, affording a high level of generality. When the kernel is chosen to be rotation-invariant, one can obtain explicit spectral representations which elucidate the regularity structure of the associated Hilbert spaces, and one can also solve the reconstruction problem at the same computational cost as filtered back-projection.  ( 2 min )
    Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant. (arXiv:2306.09338v3 [cs.LG] UPDATED)
    This article provides a comprehensive understanding of optimization in deep learning, with a primary focus on the challenges of gradient vanishing and gradient exploding, which normally lead to diminished model representational ability and training instability, respectively. We analyze these two challenges through several strategic measures, including the improvement of gradient flow and the imposition of constraints on a network's Lipschitz constant. To help understand the current optimization methodologies, we categorize them into two classes: explicit optimization and implicit optimization. Explicit optimization methods involve direct manipulation of optimizer parameters, including weight, gradient, learning rate, and weight decay. Implicit optimization methods, by contrast, focus on improving the overall landscape of a network by enhancing its modules, such as residual shortcuts, normalization methods, attention mechanisms, and activations. In this article, we provide an in-depth analysis of these two optimization classes and undertake a thorough examination of the Jacobian matrices and the Lipschitz constants of many widely used deep learning modules, highlighting existing issues as well as potential improvements. Moreover, we also conduct a series of analytical experiments to substantiate our theoretical discussions. This article does not aim to propose a new optimizer or network. Rather, our intention is to present a comprehensive understanding of optimization in deep learning. We hope that this article will assist readers in gaining a deeper insight in this field and encourages the development of more robust, efficient, and high-performing models.  ( 3 min )
    arfpy: A python package for density estimation and generative modeling with adversarial random forests. (arXiv:2311.07366v1 [stat.ML])
    This paper introduces $\textit{arfpy}$, a python implementation of Adversarial Random Forests (ARF) (Watson et al., 2023), which is a lightweight procedure for synthesizing new data that resembles some given data. The software $\textit{arfpy}$ equips practitioners with straightforward functionalities for both density estimation and generative modeling. The method is particularly useful for tabular data and its competitive performance is demonstrated in previous literature. As a major advantage over the mostly deep learning based alternatives, $\textit{arfpy}$ combines the method's reduced requirements in tuning efforts and computational resources with a user-friendly python interface. This supplies audiences across scientific fields with software to generate data effortlessly.  ( 2 min )
    Towards Bounding Causal Effects under Markov Equivalence. (arXiv:2311.07259v1 [stat.ML])
    Predicting the effect of unseen interventions is a fundamental research question across the data sciences. It is well established that, in general, such questions cannot be answered definitively from observational data, e.g., as a consequence of unobserved confounding. A generalization of this task is to determine non-trivial bounds on causal effects induced by the data, also known as the task of partial causal identification. In the literature, several algorithms have been developed for solving this problem. Most, however, require a known parametric form or a fully specified causal diagram as input, which is usually not available in practical applications. In this paper, we assume as input a less informative structure known as a Partial Ancestral Graph, which represents a Markov equivalence class of causal diagrams and is learnable from observational data. In this more "data-driven" setting, we provide a systematic algorithm to derive bounds on causal effects that can be computed analytically.  ( 2 min )
    Estimating optical vegetation indices with Sentinel-1 SAR data and AutoML. (arXiv:2311.07537v1 [stat.ML])
    Current optical vegetation indices (VIs) for monitoring forest ecosystems are widely used in various applications. However, continuous monitoring based on optical satellite data can be hampered by atmospheric effects such as clouds. On the contrary, synthetic aperture radar (SAR) data can offer insightful and systematic forest monitoring with complete time series due to signal penetration through clouds and day and night acquisitions. The goal of this work is to overcome the issues affecting optical data with SAR data and serve as a substitute for estimating optical VIs for forests using machine learning. Time series of four VIs (LAI, FAPAR, EVI and NDVI) were estimated using multitemporal Sentinel-1 SAR and ancillary data. This was enabled by creating a paired multi-temporal and multi-modal dataset in Google Earth Engine (GEE), including temporally and spatially aligned Sentinel-1, Sentinel-2, digital elevation model (DEM), weather and land cover datasets (MMT-GEE). The use of ancillary features generated from DEM and weather data improved the results. The open-source Automatic Machine Learning (AutoML) approach, auto-sklearn, outperformed Random Forest Regression for three out of four VIs, while a 1-hour optimization length was enough to achieve sufficient results with an R2 of 69-84% low errors (0.05-0.32 of MAE depending on VI). Great agreement was also found for selected case studies in the time series analysis and in the spatial comparison between the original and estimated SAR-based VIs. In general, compared to VIs from currently freely available optical satellite data and available global VI products, a better temporal resolution (up to 240 measurements/year) and a better spatial resolution (20 m) were achieved using estimated SAR-based VIs. A great advantage of the SAR-based VI is the ability to detect abrupt forest changes with a sub-weekly temporal accuracy.  ( 3 min )
    Slowly Varying Regression under Sparsity. (arXiv:2102.10773v5 [cs.LG] UPDATED)
    We present the framework of slowly varying regression under sparsity, allowing sparse regression models to exhibit slow and sparse variations. The problem of parameter estimation is formulated as a mixed-integer optimization problem. We demonstrate that it can be precisely reformulated as a binary convex optimization problem through a novel relaxation technique. This relaxation involves a new equality on Moore-Penrose inverses, convexifying the non-convex objective function while matching the original objective on all feasible binary points. This enables us to efficiently solve the problem to provable optimality using a cutting plane-type algorithm. We develop a highly optimized implementation of this algorithm, substantially improving upon the asymptotic computational complexity of a straightforward implementation. Additionally, we propose a fast heuristic method that guarantees a feasible solution and, as empirically illustrated, produces high-quality warm-start solutions for the binary optimization problem. To tune the framework's hyperparameters, we suggest a practical procedure relying on binary search that, under certain assumptions, is guaranteed to recover the true model parameters. On both synthetic and real-world datasets, we demonstrate that the resulting algorithm outperforms competing formulations in comparable times across various metrics, including estimation accuracy, predictive power, and computational time. The algorithm is highly scalable, allowing us to train models with thousands of parameters. Our implementation is available open-source at https://github.com/vvdigalakis/SSVRegression.git.  ( 3 min )
    Interpreting Neural Policies with Disentangled Tree Representations. (arXiv:2210.06650v2 [cs.LG] UPDATED)
    The advancement of robots, particularly those functioning in complex human-centric environments, relies on control solutions that are driven by machine learning. Understanding how learning-based controllers make decisions is crucial since robots are often safety-critical systems. This urges a formal and quantitative understanding of the explanatory factors in the interpretability of robot learning. In this paper, we aim to study interpretability of compact neural policies through the lens of disentangled representation. We leverage decision trees to obtain factors of variation [1] for disentanglement in robot learning; these encapsulate skills, behaviors, or strategies toward solving tasks. To assess how well networks uncover the underlying task dynamics, we introduce interpretability metrics that measure disentanglement of learned neural dynamics from a concentration of decisions, mutual information and modularity perspective. We showcase the effectiveness of the connection between interpretability and disentanglement consistently across extensive experimental analysis.  ( 2 min )
    Simulator-Based Inference with Waldo: Confidence Regions by Leveraging Prediction Algorithms and Posterior Estimators for Inverse Problems. (arXiv:2205.15680v4 [stat.ML] UPDATED)
    Prediction algorithms, such as deep neural networks (DNNs), are used in many domain sciences to directly estimate internal parameters of interest in simulator-based models, especially in settings where the observations include images or complex high-dimensional data. In parallel, modern neural density estimators, such as normalizing flows, are becoming increasingly popular for uncertainty quantification, especially when both parameters and observations are high-dimensional. However, parameter inference is an inverse problem and not a prediction task; thus, an open challenge is to construct conditionally valid and precise confidence regions, with a guaranteed probability of covering the true parameters of the data-generating process, no matter what the (unknown) parameter values are, and without relying on large-sample theory. Many simulator-based inference (SBI) methods are indeed known to produce biased or overly confident parameter regions, yielding misleading uncertainty estimates. This paper presents WALDO, a novel method to construct confidence regions with finite-sample conditional validity by leveraging prediction algorithms or posterior estimators that are currently widely adopted in SBI. WALDO reframes the well-known Wald test statistic, and uses a computationally efficient regression-based machinery for classical Neyman inversion of hypothesis tests. We apply our method to a recent high-energy physics problem, where prediction with DNNs has previously led to estimates with prediction bias. We also illustrate how our approach can correct overly confident posterior regions computed with normalizing flows.  ( 3 min )
    Data-driven rules for multidimensional reflection problems. (arXiv:2311.06639v1 [math.OC])
    Over the recent past data-driven algorithms for solving stochastic optimal control problems in face of model uncertainty have become an increasingly active area of research. However, for singular controls and underlying diffusion dynamics the analysis has so far been restricted to the scalar case. In this paper we fill this gap by studying a multivariate singular control problem for reversible diffusions with controls of reflection type. Our contributions are threefold. We first explicitly determine the long-run average costs as a domain-dependent functional, showing that the control problem can be equivalently characterized as a shape optimization problem. For given diffusion dynamics, assuming the optimal domain to be strongly star-shaped, we then propose a gradient descent algorithm based on polytope approximations to numerically determine a cost-minimizing domain. Finally, we investigate data-driven solutions when the diffusion dynamics are unknown to the controller. Using techniques from nonparametric statistics for stochastic processes, we construct an optimal domain estimator, whose static regret is bounded by the minimax optimal estimation rate of the unreflected process' invariant density. In the most challenging situation, when the dynamics must be learned simultaneously to controlling the process, we develop an episodic learning algorithm to overcome the emerging exploration-exploitation dilemma and show that given the static regret as a baseline, the loss in its sublinear regret per time unit is of natural order compared to the one-dimensional case.  ( 2 min )
    RankSEG: A Consistent Ranking-based Framework for Segmentation. (arXiv:2206.13086v3 [stat.ML] UPDATED)
    Segmentation has emerged as a fundamental field of computer vision and natural language processing, which assigns a label to every pixel/feature to extract regions of interest from an image/text. To evaluate the performance of segmentation, the Dice and IoU metrics are used to measure the degree of overlap between the ground truth and the predicted segmentation. In this paper, we establish a theoretical foundation of segmentation with respect to the Dice/IoU metrics, including the Bayes rule and Dice-/IoU-calibration, analogous to classification-calibration or Fisher consistency in classification. We prove that the existing thresholding-based framework with most operating losses are not consistent with respect to the Dice/IoU metrics, and thus may lead to a suboptimal solution. To address this pitfall, we propose a novel consistent ranking-based framework, namely RankDice/RankIoU, inspired by plug-in rules of the Bayes segmentation rule. Three numerical algorithms with GPU parallel execution are developed to implement the proposed framework in large-scale and high-dimensional segmentation. We study statistical properties of the proposed framework. We show it is Dice-/IoU-calibrated, and its excess risk bounds and the rate of convergence are also provided. The numerical effectiveness of RankDice/mRankDice is demonstrated in various simulated examples and Fine-annotated CityScapes, Pascal VOC and Kvasir-SEG datasets with state-of-the-art deep learning architectures.  ( 3 min )
    Augmented Bridge Matching. (arXiv:2311.06978v1 [cs.LG])
    Flow and bridge matching are a novel class of processes which encompass diffusion models. One of the main aspect of their increased flexibility is that these models can interpolate between arbitrary data distributions i.e. they generalize beyond generative modeling and can be applied to learning stochastic (and deterministic) processes of arbitrary transfer tasks between two given distributions. In this paper, we highlight that while flow and bridge matching processes preserve the information of the marginal distributions, they do \emph{not} necessarily preserve the coupling information unless additional, stronger optimality conditions are met. This can be problematic if one aims at preserving the original empirical pairing. We show that a simple modification of the matching process recovers this coupling by augmenting the velocity field (or drift) with the information of the initial sample point. Doing so, we lose the Markovian property of the process but preserve the coupling information between distributions. We illustrate the efficiency of our augmentation in learning mixture of image translation tasks.  ( 2 min )
    A statistical perspective on algorithm unrolling models for inverse problems. (arXiv:2311.06395v1 [stat.ML])
    We consider inverse problems where the conditional distribution of the observation ${\bf y}$ given the latent variable of interest ${\bf x}$ (also known as the forward model) is known, and we have access to a data set in which multiple instances of ${\bf x}$ and ${\bf y}$ are both observed. In this context, algorithm unrolling has become a very popular approach for designing state-of-the-art deep neural network architectures that effectively exploit the forward model. We analyze the statistical complexity of the gradient descent network (GDN), an algorithm unrolling architecture driven by proximal gradient descent. We show that the unrolling depth needed for the optimal statistical performance of GDNs is of order $\log(n)/\log(\varrho_n^{-1})$, where $n$ is the sample size, and $\varrho_n$ is the convergence rate of the corresponding gradient descent algorithm. We also show that when the negative log-density of the latent variable ${\bf x}$ has a simple proximal operator, then a GDN unrolled at depth $D'$ can solve the inverse problem at the parametric rate $O(D'/\sqrt{n})$. Our results thus also suggest that algorithm unrolling models are prone to overfitting as the unrolling depth $D'$ increases. We provide several examples to illustrate these results.  ( 2 min )
    Online multiple testing with e-values. (arXiv:2311.06412v1 [stat.ME])
    A scientist tests a continuous stream of hypotheses over time in the course of her investigation -- she does not test a predetermined, fixed number of hypotheses. The scientist wishes to make as many discoveries as possible while ensuring the number of false discoveries is controlled -- a well recognized way for accomplishing this is to control the false discovery rate (FDR). Prior methods for FDR control in the online setting have focused on formulating algorithms when specific dependency structures are assumed to exist between the test statistics of each hypothesis. However, in practice, these dependencies often cannot be known beforehand or tested after the fact. Our algorithm, e-LOND, provides FDR control under arbitrary, possibly unknown, dependence. We show that our method is more powerful than existing approaches to this problem through simulations. We also formulate extensions of this algorithm to utilize randomization for increased power, and for constructing confidence intervals in online selective inference.  ( 2 min )
    Generalized Simultaneous Perturbation-based Gradient Search with Reduced Estimator Bias. (arXiv:2212.10477v2 [cs.LG] UPDATED)
    We present in this paper a family of generalized simultaneous perturbation-based gradient search (GSPGS) estimators that use noisy function measurements. The number of function measurements required by each estimator is guided by the desired level of accuracy. We first present in detail unbalanced generalized simultaneous perturbation stochastic approximation (GSPSA) estimators and later present the balanced versions (B-GSPSA) of these. We extend this idea further and present the generalized smoothed functional (GSF) and generalized random directions stochastic approximation (GRDSA) estimators, respectively, as well as their balanced variants. We show that estimators within any specified class requiring more number of function measurements result in lower estimator bias. We present a detailed analysis of both the asymptotic and non-asymptotic convergence of the resulting stochastic approximation schemes. We further present a series of experimental results with the various GSPGS estimators on the Rastrigin and quadratic function objectives. Our experiments are seen to validate our theoretical findings.  ( 2 min )
    Compact Matrix Quantum Group Equivariant Neural Networks. (arXiv:2311.06358v1 [cs.LG])
    We derive the existence of a new type of neural network, called a compact matrix quantum group equivariant neural network, that learns from data that has an underlying quantum symmetry. We apply the Woronowicz formulation of Tannaka-Krein duality to characterise the weight matrices that appear in these neural networks for any easy compact matrix quantum group. We show that compact matrix quantum group equivariant neural networks contain, as a subclass, all compact matrix group equivariant neural networks. Moreover, we obtain characterisations of the weight matrices for many compact matrix group equivariant neural networks that have not previously appeared in the machine learning literature.  ( 2 min )

  • Open

    [D] Exact kNN for high dimensional data
    Hi guys, I am new to ML and was experimenting with kNN search algorithms. I have very high dimensional data 1000+ dimensions. What data structure is best suited for such high dimensional data. I can bring down the dimensions to apprix 150 using using PCA without being too lossy. Even then I am having hard time finding techniques that work with such high dimensional data. I am not looking for Approximate NN search using LSH. What is the best technique that can be used here, kd tree doesn't work well with high dimensional data, would Rtree or ball tree be a better choice or something different? submitted by /u/xero786 [link] [comments]  ( 9 min )
    [R] Beyond U: Making Diffusion Models Faster & Lighter
    Paper: https://arxiv.org/abs/2310.20092 Abstract: Diffusion models are a family of generative models that yield record-breaking performance in tasks such as image synthesis, video generation, and molecule design. Despite their capabilities, their efficiency, especially in the reverse denoising process, remains a challenge due to slow convergence rates and high computational costs. In this work, we introduce an approach that leverages continuous dynamical systems to design a novel denoising network for diffusion models that is more parameter-efficient, exhibits faster convergence, and demonstrates increased noise robustness. Experimenting with denoising probabilistic diffusion models, our framework operates with approximately a quarter of the parameters and 30% of the Floating Point Operations (FLOPs) compared to standard U-Nets in Denoising Diffusion Probabilistic Models (DDPMs). Furthermore, our model is up to 70% faster in inference than the baseline models when measured in equal conditions while converging to better quality solutions. https://preview.redd.it/djk9mdlc9e0c1.png?width=995&format=png&auto=webp&s=65a002f1f320e68b71753ac32c6386c22e76c1c9 https://preview.redd.it/i87gizkc9e0c1.png?width=1108&format=png&auto=webp&s=34f25ecc319ffa34f545e850a5c95cb007e0abd8 ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
    Are LLMs at a practical limit for layer stacking? [D]
    Can LLMs stack more layers than the largest ones currently have, or is it bottlenecked? Is it because the gradients can’t propagate properly to the beginning of the network? Because inference would be to slow? If anyone could provide a paper that talks about layer stacking scaling I would love to read it! submitted by /u/cstein123 [link] [comments]  ( 9 min )
    [D] The random feature zoo
    I am trying to wrap my head around methods that use random selection of features and then fit to data solving a linear optimization problem. I've seen a bunch of names of methods in the literature that look essentially the same. There are methods like random vector functional link which use a neural network type architecture. Similar methods were done for radial basis functions. Extreme learning machines are basically the same as these and took their ideas. Then there are random feature methods. This branch of literature seems to begin with Rahimi and Recht. They seem to come from kernel methods and random Fourier series. Should I think of random features as an umbrella term over all the kernel setups including single layer neural network and radial basis functions? submitted by /u/doncosaco [link] [comments]  ( 9 min )
    [D] Online MS in Machine Learning programs?
    Hey everyone! My work has given us the clear to have some amount of tuition reimbursed for continuing education. I currently have an MS in Statistics, but it's largely focused on traditional statistics with a bit of ML. I think it makes sense to pursue an MS in ML to further my understanding, but since I'm employed full-time, in-person classes are not an option. Does the community have an opinions on which programs may be more or less worth looking into for a respectable credential and a strong learning experience? ​ submitted by /u/Karsticles [link] [comments]  ( 9 min )
    Need Help... Apple Create ML Giving weird results [D] [P]
    Hi guys, just a little background of what I'm trying to create: I am aiming to create a model that converts sign language to text, using the WLASL dataset. Now, from the get-go, downloading this model from kaggle, while the dataset seems quite comprehensive, the amount of videos per class range from 5-13, which is obviously quite less to train on. I decided to try out Apple Create ML instead of something like tensorflow or even more complex deep learning frameworks as this would be much more simple. Since the dataset is quite limited in terms of videos per class, I used all 6 data augmentations in the "Hand Action Classifier" (Horizontally Flip, Rotate, Translate, Scale, Interpolate Frames, Drop Frames). While I knew this could not save the model, it would definitely increase the accuracy…  ( 10 min )
    [P] Research project ideas
    I‘m looking for a research project that I can work on every day for about 6 months. At the moment I am interested in active learning and/or ML in medicine. But I’m also open to other areas. I don‘t have much experience in ML, but have a lot of time to invest in. submitted by /u/Miserable_Air_5272 [link] [comments]  ( 9 min )
    [D] Where to Benchmark Neural Net to Run on the Edge?
    I have a neural net using the MobilenetV3 and tensorflow lite and want to run it on the edge (locally on the chip with no server connection), but I need to determine which chips are capable or running it or how much I need to reduce the scope of my neural net. A few months ago I ran my neural net program through a website that benchmarked it across different chip manufafcturers and families of chips to find out what chips I could use. However, I can't find that website. What are my options for benchmarking my neural net performance across many different chip manufactuers to see what chips are capable of running it? submitted by /u/freebird4446 [link] [comments]  ( 9 min )
    [D] Paperspace credits
    Digital Ocean offers 200 dollars of credit on their products. They specifically didnt mention paperspace. Can I use thos credits to rent high end gpus on paperspace? submitted by /u/DirectionOdd9824 [link] [comments]  ( 9 min )
    [D] The Trade-offs in Model Complexity: Performance vs. Interpretability
    I've been grappling with a challenge that I believe many of you might have encountered in your work: the delicate balance between model complexity and interpretability. As our models become increasingly sophisticated, achieving remarkable performance levels, there's a noticeable trade-off with the ease of understanding and explaining these models' decisions. Performance Metrics: How do you evaluate the performance of your models? Do you prioritize accuracy, precision/recall, or some other metric? How do you decide what's most important for your specific application? Complex Models vs. Simplicity: In your experience, what are the pros and cons of using more complex models (like deep learning) versus simpler, more interpretable models (like decision trees or linear regression)? Are there specific scenarios where one significantly outperforms the other? Interpretability Tools: What tools or techniques do you use to interpret complex models? How effective have they been in providing insights into your model's decision-making process? Ethical Considerations: How do you address the ethical implications of using complex models, especially in sensitive areas like healthcare, finance, or criminal justice, where explainability is crucial? Future Directions: Where do you see the future of machine learning heading in terms of balancing performance with interpretability? Do you foresee any breakthroughs that might help resolve this trade-off? I'm looking forward to a robust discussion on this topic. Let's share our experiences, challenges, and insights to better understand how we can develop powerful yet understandable machine learning models. submitted by /u/Kinz0id [link] [comments]  ( 9 min )
    [D] Kaggle House Price Competition Low Scores
    I see some crazy low scores for the Kaggle House Prices Competition, the latest notebooks I see are all getting scores in the 0.1 - 0.15 range. I am not sure how leaderboard is getting such low scores. ​ https://preview.redd.it/vl5sm8c4sb0c1.png?width=1330&format=png&auto=webp&s=73fc9789830f09723fe57ac14c8cf16785a0622b The most popular notebooks are also getting scores in that range as well, is there some new algorithm that people are using? ​ Which algorithm are they using? submitted by /u/ng_guardian [link] [comments]  ( 9 min )
    [D] Tips for parameter efficient fine-tuning of language models
    Hi, I'm using the PEFT library for fine-tuning a BART model in generation tasks. I have two questions: I experimented with Prompt Tuning, IA3 and LoRA. The first two yielded crappy results (ROUGE scores ~1-2 points after 30K steps ..) while LoRA had acceptable scores (close to fine-tuning). In terms of trainable params, Prompt Tuning and IA3 has around 200K whereas LoRA had ~2M (does not matter much to me though). Is LoRA still the best method for efficient fine-tuning, or am I just setting the hyperparams wrong somewhere ? Normally for full fine-tuning I would use a learning rate of 1e-5 with linear decay. And I'm using the same parameters for efficient fine-tuning (LoRA, IA3 .. ). Does this make sense or should I use a larger values ? (e.g. I see the IA3 paper author setting lr around 3e-3, but they work on classification tasks so I'm not sure) Many thanks in advance ! submitted by /u/KarmaCut132 [link] [comments]  ( 9 min )
    [D] Any pre-trained models for translation to English?
    I am looking for a pre-trained transformer model that would be able to translate texts in various languages to English. I have found some capable of this. e.g. SeamlessM4T or Madlad-400, however, these are both for multilingual translation and are a bit slow for my use case. Can anyone recommend a fast model, perhaps even model that's pre-trained to only translate to English (which is all I need)? Use case: I need to translate millions of texts to English and I want to avoid using APIs or too large models. submitted by /u/6NBUonmLD74a [link] [comments]  ( 9 min )
    [D] Is there any research for RAG where vectors that are queried in succession become associated for future queries, even if their embedding values are different?
    A common situation in IRL problems with long time horizons is the need to perform multiple very different subtasks. For example, imagine a model trained to remember a poem and then spell it out in blocks in a game of minecraft. The data for the poem itself and the appropriate minecraft functions probably have very different embeddings, but in practice it would probably be useful to ensure the memories for how to use minecraft functions are queried when that poem is queried. It seems like just querying a RAG DB for the vectors with the highest cosine similarity won't be super useful for this task. A query for poems will just find poem-like data. But we don't just want to find things with similar embeddings to poems, we want to find data that is useful. Has there been any research into this time-series / associative type of RAG? ​ submitted by /u/30299578815310 [link] [comments]  ( 9 min )
    ChatGPT prompts for other LLM chatbots [D]
    Has the community got best practices for writing prompts? I'm looking at chatGPT but it may be that there are principles that work with the other LLM chatbots and some of these prompt examples out there are too chatGPT specific submitted by /u/HotRepresentative325 [link] [comments]  ( 1 min )
    [R] Brainstorming Multiple Actions in Continuous Control
    Brainstorming Multiple Actions in Continuous Control Environments with continuous action spaces are a big challenge in RL. Because there are an infinite number of possible actions, evaluating every possible action isn't feasible. Instead, actors are trained to output a single action that maximizes value according to the critic. But it seems hard for the actor to consistently find the global optimum of the critic; so, why not output multiple actions, and use the value network to choose the best one? I built off of TD3, modifying the actor to output multiple actions, and trained it with an auxiliary contrastive loss to ensure diversity. Overall, I found this agent, TD3-M, improved results on both Gymnasium MuJoCo environments and the DeepMind Control Suite over vanilla TD3. Compared to related work such as actor ensembles, this method (multiheaded actor + contrastive loss) is simple, computationally cheap, and high performance, with TD3-M almost matching Dreamerv3's performance on the DeepMind Control Suite. submitted by /u/307thML [link] [comments]  ( 9 min )
    [P] Labelformat now supports all major vision labeling formats and runs on Windows, Linux and macOS
    We’re happy to share an update on Labelformat - A tool for converting computer vision label formats. This project started internally at Lightly as we had to deal with lots of different labeling formats for our projects. We didn’t find anything working out of the box without having to sign up for a service. We just wanted some easy-to-use tool that could read and write from different formats. The open-source package is under the MIT license. Our hope is that others might contribute additional formats to the repo to solve this issue. Please take a look. If you like what we do and think a format is missing, create an issue. If you feel motivated enough and have the time, we would be super grateful for a PR. And if you don’t have the time to create an issue or PR but still want to support us, you can also leave a star :) Features Support for common dataset label formats such as YOLO, COCO, Kitti (more coming soon) Support for common tool formats such as Labelbox (more coming soon) Minimal dependencies, targets Python 3.7 or higher Memory conscious - datasets are processed file-by-file instead of loading everything in memory (when possible) Typed Tested with round trip tests to ensure consistency MIT license Runs on Windows, Linux, macOS Github link: https://github.com/lightly-ai/labelformat submitted by /u/igorsusmelj [link] [comments]  ( 9 min )
    [P] I got a ML internship and have no idea what to do
    I’ve got research background in ML but never actually developed any models as it was all theoretical work. I got lucky during the interview stage for this role as my research impressed them. My project involves fine-tuning a GPT-3 model for a specific task and host the model on a website. Does anyone have any tips on how to go about learning what I need to know to do this? Also what should I consider when curating my custom dataset when fine-tuning the model? I really want this to be a learning experience for me. submitted by /u/danielwilu2525 [link] [comments]  ( 9 min )
    [P] Higgsfield.AI – Anyone can train Llama 70B or Mistral for free
    https://higgsfield.ai We have a massive GPU cluster and developed our own infrastructure to manage the cluster and train massive models. There's how it works: You upload the dataset with preconfigured format into HuggingFaсe [1]. Choose your LLM (e.g. LLaMa 70B, Mistral 7B) Place your submission into the queue Wait for it to get trained. Then you get your trained model there on HuggingFace. Essentially, why would we want to do it? We already have an experience with training big LLMs. We could achieve near-perfect infrastructure performance for training. Sometimes GPUs have just nothing to train. Thus we thought it would be cool if we could utilize our GPU cluster 100%. And give back to Open Source community (already built an e2e distributed training framework [2]). This is in an early stage, so you can expect some bugs. Any thoughts, opinions, or ideas are quite welcome! [1]: https://github.com/higgsfield-ai/higgsfield/blob/main/tutori... [2]: https://github.com/higgsfield-ai/higgsfield submitted by /u/higgsfield_ai [link] [comments]  ( 9 min )
    [D] How does `@jax.jit` compare with PyTorch's `@torch.compile`?
    Have any of you seen benchmarks comparing the performance of `@jax.jit` and `@torch.compile`, especially when using functional PyTorch code? Are the performance differences big? Small? Do they depend a lot on what you're doing? submitted by /u/Llamas1115 [link] [comments]  ( 9 min )
  • Open

    Flag harmful content using Amazon Comprehend toxicity detection
    Online communities are driving user engagement across industries like gaming, social media, ecommerce, dating, and e-learning. Members of these online communities trust platform owners to provide a safe and inclusive environment where they can freely consume content and contribute. Content moderators are often employed to review user-generated content and check that it’s safe and compliant […]  ( 7 min )
    Fine-tune and Deploy Mistral 7B with Amazon SageMaker JumpStart
    Today, we are excited to announce the capability to fine-tune the Mistral 7B model using Amazon SageMaker JumpStart. You can now fine-tune and deploy Mistral text generation models on SageMaker JumpStart using the Amazon SageMaker Studio UI with a few clicks or using the SageMaker Python SDK. Foundation models perform very well with generative tasks, […]  ( 10 min )
    Harness large language models in fake news detection
    Fake news, defined as news that conveys or incorporates false, fabricated, or deliberately misleading information, has been around as early as the emergence of the printing press. The rapid spread of fake news and disinformation online is not only deceiving to the public, but can also have a profound impact on society, politics, economy, and […]  ( 17 min )
    Model management for LoRA fine-tuned models using Llama2 and Amazon SageMaker
    In the era of big data and AI, companies are continually seeking ways to use these technologies to gain a competitive edge. One of the hottest areas in AI right now is generative AI, and for good reason. Generative AI offers powerful solutions that push the boundaries of what’s possible in terms of creativity and […]  ( 13 min )
  • Open

    Stable Baselines AssertionError: Your environment must inherit from the gymnasium.Env class cf. https://gymnasium.farama.org/api/env/
    Im trying to follow this tutorial https://www.youtube.com/watch?v=uKnjGn8fF70&t=1017s ​ AssertionError: Your environment must inherit from the gymnasium.Env class cf. https://gymnasium.farama.org/api/env/ ​ The error is on !pip install stable-baselines3[extra]from stable_baselines3.common.env_checker import check_envenv = PongEnv()check_env(env) ​ I tried to downgrade to gym v 0.21 but it did not build, outputRequirement already satisfied: setuptools==65.5.0 in /usr/local/lib/python3.10/dist-packages (65.5.0) Collecting gym==0.21 Using cached gym-0.21.0.tar.gz (1.5 MB) Preparing metadata (setup.py) ... done Requirement already satisfied: numpy>=1.18.0 in /usr/local/lib/python3.10/dist-packages (from gym==0.21) (1.23.5) Requirement already satisfied: cloudpickle>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from gym==0.21) (2.2.1) Building wheels for collected packages: gym error: subprocess-exited-with-error × python setup.py bdist_wheel did not run successfully. │ exit code: 1 ╰─> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip. Building wheel for gym (setup.py) ... error ERROR: Failed building wheel for gym Running setup.py clean for gym Failed to build gym ERROR: Could not build wheels for gym, which is required to install pyproject.toml-based projects I'd like to use the latest version, how can I correct check_env to work with 0.26 submitted by /u/Sharp-Cat2319 [link] [comments]  ( 1 min )
    State representation for a large dynamic state space
    I have a multi-object placement problem, as the title says the number of objects to be placed are dynamic, and the action to move can be taken on any of the objects. Each state has a rational score based on the reward function and the objects need to be placed in an optimal way whose rational score is high enough. The problem I am facing is about the representation of the state space, if I choose a simple table lookup the computations are extremely large as there are over a billion states for a simple model. My supervisor said about using functional approximations for states. Still I am very confused on how to do it, any help would be appreciated. Thanks. submitted by /u/RuthlessDaoist [link] [comments]  ( 1 min )
    Tensorflow DQN
    I am trying to train a TF DQN agent on a problem where the next state is not the observation of the current state, but rather a new state sampled from a dataset of states. I am currently returning an empty state after each step, would that be the ideal thing to do or is there another way of doing this? Is it even logical to train on states like this? submitted by /u/muzzez321 [link] [comments]  ( 1 min )
    "Hidden Incentives for Auto-Induced Distributional Shift", Krueger et al 202
    submitted by /u/gwern [link] [comments]  ( 1 min )
    PPO for maintenance planning
    Hi. I am currently trying to optimize a maintenance system using PPO. I have set my reward structure to: Action 0: reward = C1 * exp(-(C2 * t)) Action 1: reward = -C3 C1,2,3 are constants. t is number of time steps since action 1 was performed. When C3 is 0 the program performs as expected, but as I increase the penalty for action 1 the program performs it more frequently. As C3 increases, the program seemingly converges to performing random actions at every time step and racks up massive penalties. Does anyone know why I have this result or a solution to the problem? submitted by /u/SkivetOst [link] [comments]  ( 1 min )
    Offline Reinforcement Learning Resources
    I’m looking for some resources that can help me understand the practical aspects of implementing offline RL algorithms, such as data preprocessing, model selection, evaluation metrics, and debugging. Ideally, I would like to find some walkthroughs or tutorials that provide thorough code examples and explanations. Does anyone have any recommendations? Thanks in advance! submitted by /u/anointedninja [link] [comments]  ( 1 min )
    Finally a clear article on Terminated vs Truncated states
    I was treating them the same way for done in my replay buffer. I suppose a follow up question, is "timing out" not a failed state, if the agent needs to complete a task within a set number of steps? In this case truncated is not used and terminated can be set to True and punishing the agent. https://farama.org/Gymnasium-Terminated-Truncated-Step-API Edit: Last paragraph answers my question: Note that while finite horizon tasks end due to a time limit, this would be considered a termination since the time limit is built into the task. For these tasks, to preserve the Markov property, it is essential to add information about ‘time remaining’ in the state. For this reason, Gym includes a TimeObservation wrapper for users who wish to include the current time step in the agent’s observation. submitted by /u/_Linux_AI_ [link] [comments]  ( 1 min )
  • Open

    Google wants governments to form a 'global AI corps'
    Google is calling on governments to strengthen their AI training and skilling initiatives and develop a 'global AI corps'. The tech giant's white paper urges governments to expand certificate and skilling programs, build local research centers, and focus on channeling AI's potential benefits. Google also supports the idea of creating an 'AI education bill' to retrain and skill at least 1 million people, similar to the GI Bill for veterans. The paper is expected to guide Google's approach to regulatory talks in Washington. Google's white paper arrives as Senate lawmakers discuss boosting AI development while setting guardrails around its use. President Biden's AI executive order, which aims to guide the federal government's deployment of AI tools, is seen as having 'very promising elements' by Google. However, there are still questions about how federal agencies will write and implement guidelines around AI use. Source : https://www.washingtonpost.com/politics/2023/11/14/google-wants-governments-form-global-ai-corps/ submitted by /u/NuseAI [link] [comments]  ( 1 min )
    Clean up a scanned PDF
    I scanned an old sign in sheet with about 40 signatures and I need to clean it up a bit. At a glance, you can easily tell it's a copy just by the way the signatures look from being scanned & from being really old and I want to make it look as clean as possible. This is for archival / scrapbooking purposes. submitted by /u/rebelbaserec [link] [comments]  ( 1 min )
    AI website builder
    Hey Guys, What’s the best AI website builder platform to use to build a website from start to finish? Thanks! submitted by /u/SirCanIHelpYou [link] [comments]  ( 1 min )
    Deepmind create state-of-the-art model for fast and accurate 10-day weather predictions
    submitted by /u/fesenjoon [link] [comments]  ( 1 min )
    Is there a way for a Speech-to-Text model to differentiate between speakers?
    For instance, if I'm recording an interview between two people, and I have something like Whisper recording the discussion, can it break out the dialogue between the speakers? Seems like this would be a fairly simple feature, but I'm not sure if it exists. ​ Doesn't have to be Whisper per se, but is there a known S2T model or solution for this? submitted by /u/jrstelle [link] [comments]  ( 1 min )
    Approachable, but still technical, AI podcasts on generative AI techniques (e.g. like MLG)
    ​ Hey, I've been learning about AI techniques in my spare time, for both interest and work (I'm a university Prof working in HCI/interaction design and understanding the possibilities and limitations of AI -- and why, technically -- becoming an essential area of knowledge for me). I've found that the best way to learn complex techniques has been to read about them in books (perhaps not fully understanding the first time), listen to a podcast that explains them again (understanding more) and then read the same book chapter again (finally 'getting it'). Machine Learning Guide has been the perfect podcast for this. However, it doesn't cover the generative techniques I'm learning about now (e.g. VAE, GAN etc. etc. etc). I've tried looking for other podcasts, but everything I find seems to be at much more of a pop-science level: two folks having a coffee chat about general concepts and not how they work. Does anyone have any recommendations for AI podcasts similar to Machine Learning Guide but for generative techniques (or others like MLG in general, as I can't get enough!). Thanks! submitted by /u/jnthhk [link] [comments]  ( 1 min )
    When you are old
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 1 min )
    Exploring AI Conversations in Nature: My Experience with ChatGPT Voice
    Recently, I embarked on an intriguing journey, blending the realms of artificial intelligence and the serenity of nature. I took a walk in the park, not alone, but accompanied by the voice of ChatGPT. For an hour, we conversed, not through the taps of a keyboard but through the simplicity of spoken words. This experience was not just a technological experiment, but a glimpse into the future of human-AI interaction. The fluidity and ease of conversation with ChatGPT's voice interface were remarkable. It felt like discussing with a knowledgeable friend, seamlessly blending into my natural environment. What struck me most was the versatility of AI in adapting to our daily routines. It's not just about screens and keyboards; it's about integrating intelligently into our lives, enhancing our experiences, and learning on the go. As we venture further into this AI-augmented era, I'm excited to see how these technologies will continue to evolve and intertwine with our daily lives. The potential for learning, sharing, and growing with AI as our companion is immense. submitted by /u/klevismiho [link] [comments]  ( 1 min )
    What AI tool is good as "dungeon master" for rol games nowadays?
    A few years ago, AI Dungeon used to be a very cool and interesting tool for this purpose, but very limited in many aspects. When ChatGPT was released, AI dungeon felt obsoleted. But then again, ChatGPT wasnt that good at remembering events from the past. Nowadays, if Im bored and want to have a quick campaing, what AI tool you guys recommend and what tips can you give me? Thank you! submitted by /u/Kastila1 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 11/13/2023
    Google sues scammers spreading malware disguised as its Bard AI chatbot.[1] Nvidia on Monday introduced the latest version H200 of its main AI processor, aimed at meeting the compute needs of organizations with large AI workloads.[2] OpenAI recruiters are trying to lure Google AI employees with $10 million pay packets.[3] OpenAI chief seeks new Microsoft funds to build ‘superintelligence’.[4] Sources: [1] https://mashable.com/article/bard-ai-malware-scam-google-lawsuit [2] https://www.techtarget.com/searchenterpriseai/news/366559236/Nvidia-intros-new-H200-for-running-large-AI-workloads [3] https://www.businessinsider.com/openai-recruiters-luring-google-ai-employees-10-million-compensation-package-2023-11 [4] https://www.ft.com/content/dd9ba2f6-f509-42f0-8e97-4271c7b84ded submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    look at this fox ai art i did with stable diffusion
    submitted by /u/marlon99rocks99 [link] [comments]  ( 1 min )
  • Open

    Scaling multimodal understanding to long videos
    Posted by Isaac Noble, Software Engineer, Google Research, and Anelia Angelova, Research Scientist, Google DeepMind When building machine learning models for real-life applications, we need to consider inputs from multiple modalities in order to capture various aspects of the world around us. For example, audio, video, and text all provide varied and complementary information about a visual input. However, building multimodal models is challenging due to the heterogeneity of the modalities. Some of the modalities might be well synchronized in time (e.g., audio, video) but not aligned with text. Furthermore, the large volume of data in video and audio signals is much larger than that in text, so when combining them in multimodal models, video and audio often cannot be fully consumed a…  ( 93 min )
  • Open

    DSC Weekly 14 November 2023
    Announcements Top Stories In-Depth The post DSC Weekly 14 November 2023 appeared first on Data Science Central.  ( 20 min )
    Beyond the numbers: The soft skills that elevate data analysts to the next level
    Data analysts must have a strong grasp of practical data visualization skills to paint a clear picture of complex data for a broader audience. Seeing the big picture by delivering coherent and easily comprehensible content is crucial. Companies highly value avant-garde data analysts who can not only dig into the data but also connect the… Read More »Beyond the numbers: The soft skills that elevate data analysts to the next level  The post Beyond the numbers: The soft skills that elevate data analysts to the next level  appeared first on Data Science Central.  ( 21 min )
    Difference between modern and traditional data quality
    Modern data quality practices leverage advanced technologies, automation, and machine learning to handle diverse data sources, ensure real-time processing, and foster collaboration across stakeholders. They prioritize data governance, continuous monitoring, and proactive management to ensure accurate, reliable, and fit-for-purpose data for informed decision-making and business success. Modern data quality practices differ from traditional data quality… Read More »Difference between modern and traditional data quality The post Difference between modern and traditional data quality appeared first on Data Science Central.  ( 20 min )
    Machine learning in marketing: 10 use cases and implementation tips
    Given how quickly the digital marketing industry changes, keeping up with the most recent trends and technologies can be challenging. Traditional marketing techniques are no longer enough to connect with and engage your target audience. Machine learning in marketing has got your back.  Check out the top 10 use cases and implementation pointers that offer… Read More »Machine learning in marketing: 10 use cases and implementation tips The post Machine learning in marketing: 10 use cases and implementation tips appeared first on Data Science Central.  ( 24 min )
    New Book: Statistical Optimization for GenAI and Machine Learning
    In the last two years, I published 5 machine learning and AI books, including one on synthetic data by Elsevier. This represents over 800 pages of compact, state-of-the-art material. The new addition features my most recent advances: the problems that I encountered with generative adversarial networks, and how I overcome them with new techniques. The… Read More »New Book: Statistical Optimization for GenAI and Machine Learning The post New Book: Statistical Optimization for GenAI and Machine Learning appeared first on Data Science Central.  ( 21 min )
  • Open

    Challenge Accepted: Animator Sir Wade Neistadt Leads Robotic Revolution in Record Time This Week ‘In the NVIDIA Studio’
    Character animator Sir Wade Neistadt works to make animation and 3D education more accessible for aspiring and professional artists alike through video tutorials and industry training.  ( 8 min )
  • Open

    Ghostbuster: Detecting Text Ghostwritten by Large Language Models
    The structure of Ghostbuster, our new state-of-the-art method for detecting AI-generated text. Large language models like ChatGPT write impressively well—so well, in fact, that they’ve become a problem. Students have begun using these models to ghostwrite assignments, leading some schools to ban ChatGPT. In addition, these models are also prone to producing text with factual errors, so wary readers may want to know if generative AI tools have been used to ghostwrite news articles or other sources before trusting them. What can teachers and consumers do? Existing tools to detect AI-generated text sometimes do poorly on data that differs from what they were trained on. In addition, if these models falsely classify real human writing as AI-generated, they can jeopardize students whose genui…  ( 4 min )
    Asymmetric Certified Robustness via Feature-Convex Neural Networks
    Asymmetric Certified Robustness via Feature-Convex Neural Networks TLDR: We propose the asymmetric certified robustness problem, which requires certified robustness for only one class and reflects real-world adversarial scenarios. This focused setting allows us to introduce feature-convex classifiers, which produce closed-form and deterministic certified radii on the order of milliseconds. Despite their widespread usage, deep learning classifiers are acutely vulnerable to adversarial examples: small, human-imperceptible image perturbations that fool machine learning models into misclassifying the modified input. This weakness severely undermines the reliability of safety-critical processes that incorporate machine learning. Many empirical defenses against adversarial perturbations have be…  ( 4 min )
  • Open

    Rational solution to Korteweg–De Vries equation
    Students seeing differential equations for the first time expect every equation to have a nice closed-form solution, because up to that point in their education nearly every problem they’ve seen has been contrived to have a nice closed-form solution. Once you resign yourself to the fact that a differential equation will rarely have a closed […] Rational solution to Korteweg–De Vries equation first appeared on John D. Cook.  ( 5 min )
  • Open

    Attention-based Multi-task Learning for Base Editor Outcome Prediction. (arXiv:2310.02919v2 [cs.LG] UPDATED)
    Human genetic diseases often arise from point mutations, emphasizing the critical need for precise genome editing techniques. Among these, base editing stands out as it allows targeted alterations at the single nucleotide level. However, its clinical application is hindered by low editing efficiency and unintended mutations, necessitating extensive trial-and-error experimentation in the laboratory. To speed up this process, we present an attention-based two-stage machine learning model that learns to predict the likelihood of all possible editing outcomes for a given genomic target sequence. We further propose a multi-task learning schema to jointly learn multiple base editors (i.e. variants) at once. Our model's predictions consistently demonstrated a strong correlation with the actual experimental results on multiple datasets and base editor variants. These results provide further validation for the models' capacity to enhance and accelerate the process of refining base editing designs.  ( 2 min )
    ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training. (arXiv:2210.01738v3 [cs.LG] UPDATED)
    CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique image-text pair in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multimodal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.  ( 2 min )
    Taming Local Effects in Graph-based Spatiotemporal Forecasting. (arXiv:2302.04071v2 [cs.LG] UPDATED)
    Spatiotemporal graph neural networks have shown to be effective in time series forecasting applications, achieving better performance than standard univariate predictors in several settings. These architectures take advantage of a graph structure and relational inductive biases to learn a single (global) inductive model to predict any number of the input time series, each associated with a graph node. Despite the gain achieved in computational and data efficiency w.r.t. fitting a set of local models, relying on a single global model can be a limitation whenever some of the time series are generated by a different spatiotemporal stochastic process. The main objective of this paper is to understand the interplay between globality and locality in graph-based spatiotemporal forecasting, while contextually proposing a methodological framework to rationalize the practice of including trainable node embeddings in such architectures. We ascribe to trainable node embeddings the role of amortizing the learning of specialized components. Moreover, embeddings allow for 1) effectively combining the advantages of shared message-passing layers with node-specific parameters and 2) efficiently transferring the learned model to new node sets. Supported by strong empirical evidence, we provide insights and guidelines for specializing graph-based models to the dynamics of each time series and show how this aspect plays a crucial role in obtaining accurate predictions.  ( 2 min )
    Dataset Distillation with Convexified Implicit Gradients. (arXiv:2302.06755v2 [cs.LG] UPDATED)
    We propose a new dataset distillation algorithm using reparameterization and convexification of implicit gradients (RCIG), that substantially improves the state-of-the-art. To this end, we first formulate dataset distillation as a bi-level optimization problem. Then, we show how implicit gradients can be effectively used to compute meta-gradient updates. We further equip the algorithm with a convexified approximation that corresponds to learning on top of a frozen finite-width neural tangent kernel. Finally, we improve bias in implicit gradients by parameterizing the neural network to enable analytical computation of final-layer parameters given the body parameters. RCIG establishes the new state-of-the-art on a diverse series of dataset distillation tasks. Notably, with one image per class, on resized ImageNet, RCIG sees on average a 108\% improvement over the previous state-of-the-art distillation algorithm. Similarly, we observed a 66\% gain over SOTA on Tiny-ImageNet and 37\% on CIFAR-100.  ( 2 min )
    Optimized Sparse Matrix Operations for Reverse Mode Automatic Differentiation. (arXiv:2212.05159v3 [cs.LG] UPDATED)
    Sparse matrix representations are ubiquitous in computational science and machine learning, leading to significant reductions in compute time, in comparison to dense representation, for problems that have local connectivity. The adoption of sparse representation in leading ML frameworks such as PyTorch is incomplete, however, with support for both automatic differentiation and GPU acceleration missing. In this work, we present an implementation of a CSR-based sparse matrix wrapper for PyTorch with CUDA acceleration for basic matrix operations, as well as automatic differentiability. We also present several applications of the resulting sparse kernels to optimization problems, demonstrating ease of implementation and performance measurements versus their dense counterparts.  ( 2 min )
    Sample Efficient Reward Augmentation in offline-to-online Reinforcement Learning. (arXiv:2310.19805v2 [cs.LG] UPDATED)
    A prospective application of offline reinforcement learning (RL) involves initializing a pre-trained policy using existing static datasets for subsequent online fine-tuning. However, direct fine-tuning of the offline pre-trained policy often results in sub-optimal performance. A primary reason is that offline conservative methods diminish the agent's capability of exploration, thereby impacting online fine-tuning performance. To enhance exploration during online fine-tuning and thus enhance the overall online fine-tuning performance, we introduce a generalized reward augmentation framework called Sample Efficient Reward Augmentation (SERA). SERA aims to improve the performance of online fine-tuning by designing intrinsic rewards that encourage the agent to explore. Specifically, it implicitly implements State Marginal Matching (SMM) and penalizes out-of-distribution (OOD) state actions, thus encouraging agents to cover the target state density, and achieving better online fine-tuning results. Additionally, SERA can be effortlessly plugged into various RL algorithms to improve online fine-tuning and ensure sustained asymptotic improvement, showing the versatility as well as the effectiveness of SERA. Moreover, extensive experimental results will demonstrate that when conducting offline-to-online problems, SERA consistently and effectively enhances the performance of various offline algorithms.  ( 2 min )
    Deep Fast Vision: A Python Library for Accelerated Deep Transfer Learning Vision Prototyping. (arXiv:2311.06169v1 [cs.CV])
    Deep learning-based vision is characterized by intricate frameworks that often necessitate a profound understanding, presenting a barrier to newcomers and limiting broad adoption. With many researchers grappling with the constraints of smaller datasets, there's a pronounced reliance on pre-trained neural networks, especially for tasks such as image classification. This reliance is further intensified in niche imaging areas where obtaining vast datasets is challenging. Despite the widespread use of transfer learning as a remedy to the small dataset dilemma, a conspicuous absence of tailored auto-ML solutions persists. Addressing these challenges is "Deep Fast Vision", a python library that streamlines the deep learning process. This tool offers a user-friendly experience, enabling results through a simple nested dictionary definition, helping to democratize deep learning for non-experts. Designed for simplicity and scalability, Deep Fast Vision appears as a bridge, connecting the complexities of existing deep learning frameworks with the needs of a diverse user base.  ( 2 min )
    Neural Control of Parametric Solutions for High-dimensional Evolution PDEs. (arXiv:2302.00045v2 [math.NA] UPDATED)
    We develop a novel computational framework to approximate solution operators of evolution partial differential equations (PDEs). By employing a general nonlinear reduced-order model, such as a deep neural network, to approximate the solution of a given PDE, we realize that the evolution of the model parameter is a control problem in the parameter space. Based on this observation, we propose to approximate the solution operator of the PDE by learning the control vector field in the parameter space. From any initial value, this control field can steer the parameter to generate a trajectory such that the corresponding reduced-order model solves the PDE. This allows for substantially reduced computational cost to solve the evolution PDE with arbitrary initial conditions. We also develop comprehensive error analysis for the proposed method when solving a large class of semilinear parabolic PDEs. Numerical experiments on different high-dimensional evolution PDEs with various initial conditions demonstrate the promising results of the proposed method.  ( 2 min )
    Orthogonal Polynomials Approximation Algorithm (OPAA):a functional analytic approach to estimating probability densities. (arXiv:2211.08594v2 [cs.LG] UPDATED)
    We present the new Orthogonal Polynomials Approximation Algorithm (OPAA), a parallelizable algorithm that solves two problems from a functional analytic approach: first, it finds a smooth functional estimate of a density function, whether it is normalized or not; second, the algorithm provides an estimate of the normalizing weight. In the context of Bayesian inference, OPAA provides an estimate of the posterior function as well as the normalizing weight, which is also known as the evidence. A core component of OPAA is a special transform of the square root of the joint distribution into a special functional space of our construct. Through this transform, the evidence is equated with the $L^2$ norm of the transformed function, squared. Hence, the evidence can be estimated by the sum of squares of the transform coefficients. The computations can be parallelized and completed in one pass. To compute the transform coefficients, OPAA proposes a new computational scheme leveraging Gauss--Hermite quadrature in higher dimensions. Not only does it avoid the potential high variance problem associated with random sampling methods, it also enables one to speed up the computation by parallelization, and significantly reduces the complexity by a vector decomposition.  ( 2 min )
    LExecutor: Learning-Guided Execution. (arXiv:2302.02343v4 [cs.SE] UPDATED)
    Executing code is essential for various program analysis tasks, e.g., to detect bugs that manifest through exceptions or to obtain execution traces for further dynamic analysis. However, executing an arbitrary piece of code is often difficult in practice, e.g., because of missing variable definitions, missing user inputs, and missing third-party dependencies. This paper presents LExecutor, a learning-guided approach for executing arbitrary code snippets in an underconstrained way. The key idea is to let a neural model predict missing values that otherwise would cause the program to get stuck, and to inject these values into the execution. For example, LExecutor injects likely values for otherwise undefined variables and likely return values of calls to otherwise missing functions. We evaluate the approach on Python code from popular open-source projects and on code snippets extracted from Stack Overflow. The neural model predicts realistic values with an accuracy between 79.5% and 98.2%, allowing LExecutor to closely mimic real executions. As a result, the approach successfully executes significantly more code than any available technique, such as simply executing the code as-is. For example, executing the open-source code snippets as-is covers only 4.1% of all lines, because the code crashes early on, whereas LExecutor achieves a coverage of 51.6%.  ( 2 min )
    NESTER: An Adaptive Neurosymbolic Method for Causal Effect Estimation. (arXiv:2211.04370v4 [cs.AI] UPDATED)
    Causal effect estimation from observational data is a central problem in causal inference. Methods based on potential outcomes framework solve this problem by exploiting inductive biases and heuristics from causal inference. Each of these methods addresses a specific aspect of causal effect estimation, such as controlling propensity score, enforcing randomization, etc., by designing neural network (NN) architectures and regularizers. In this paper, we propose an adaptive method called Neurosymbolic Causal Effect Estimator (NESTER), a generalized method for causal effect estimation. NESTER integrates the ideas used in existing methods based on multi-head NNs for causal effect estimation into one framework. We design a Domain Specific Language (DSL) tailored for causal effect estimation based on causal inductive biases used in literature. We conduct a theoretical analysis to investigate NESTER's efficacy in estimating causal effects. Our comprehensive empirical results show that NESTER performs better than state-of-the-art methods on benchmark datasets.  ( 2 min )
    Understanding Reconstruction Attacks with the Neural Tangent Kernel and Dataset Distillation. (arXiv:2302.01428v2 [cs.LG] UPDATED)
    Modern deep learning requires large volumes of data, which could contain sensitive or private information that cannot be leaked. Recent work has shown for homogeneous neural networks a large portion of this training data could be reconstructed with only access to the trained network parameters. While the attack was shown to work empirically, there exists little formal understanding of its effective regime which datapoints are susceptible to reconstruction. In this work, we first build a stronger version of the dataset reconstruction attack and show how it can provably recover the \emph{entire training set} in the infinite width regime. We then empirically study the characteristics of this attack on two-layer networks and reveal that its success heavily depends on deviations from the frozen infinite-width Neural Tangent Kernel limit. Next, we study the nature of easily-reconstructed images. We show that both theoretically and empirically, reconstructed images tend to "outliers" in the dataset, and that these reconstruction attacks can be used for \textit{dataset distillation}, that is, we can retrain on reconstructed images and obtain high predictive accuracy.  ( 2 min )
    Interpretable Graph Anomaly Detection using Gradient Attention Maps. (arXiv:2311.06153v1 [cs.LG])
    Detecting unusual patterns in graph data is a crucial task in data mining. However, existing methods often face challenges in consistently achieving satisfactory performance and lack interpretability, which hinders our understanding of anomaly detection decisions. In this paper, we propose a novel approach to graph anomaly detection that leverages the power of interpretability to enhance performance. Specifically, our method extracts an attention map derived from gradients of graph neural networks, which serves as a basis for scoring anomalies. In addition, we conduct theoretical analysis using synthetic data to validate our method and gain insights into its decision-making process. To demonstrate the effectiveness of our method, we extensively evaluate our approach against state-of-the-art graph anomaly detection techniques. The results consistently demonstrate the superior performance of our method compared to the baselines.  ( 2 min )
    Computationally-Efficient Neural Image Compression with Shallow Decoders. (arXiv:2304.06244v2 [eess.IV] UPDATED)
    Neural image compression methods have seen increasingly strong performance in recent years. However, they suffer orders of magnitude higher computational complexity compared to traditional codecs, which hinders their real-world deployment. This paper takes a step forward towards closing this gap in decoding complexity by using a shallow or even linear decoding transform resembling that of JPEG. To compensate for the resulting drop in compression performance, we exploit the often asymmetrical computation budget between encoding and decoding, by adopting more powerful encoder networks and iterative encoding. We theoretically formalize the intuition behind, and our experimental results establish a new frontier in the trade-off between rate-distortion and decoding complexity for neural image compression. Specifically, we achieve rate-distortion performance competitive with the established mean-scale hyperprior architecture of Minnen et al. (2018) at less than 50K decoding FLOPs/pixel, reducing the baseline's overall decoding complexity by 80%, or over 90% for the synthesis transform alone. Our code can be found at https://github.com/mandt-lab/shallow-ntc.  ( 2 min )
    On the (In)security of Peer-to-Peer Decentralized Machine Learning. (arXiv:2205.08443v3 [cs.CR] UPDATED)
    In this work, we carry out the first, in-depth, privacy analysis of Decentralized Learning -- a collaborative machine learning framework aimed at addressing the main limitations of federated learning. We introduce a suite of novel attacks for both passive and active decentralized adversaries. We demonstrate that, contrary to what is claimed by decentralized learning proposers, decentralized learning does not offer any security advantage over federated learning. Rather, it increases the attack surface enabling any user in the system to perform privacy attacks such as gradient inversion, and even gain full control over honest users' local model. We also show that, given the state of the art in protections, privacy-preserving configurations of decentralized learning require fully connected networks, losing any practical advantage over the federated setup and therefore completely defeating the objective of the decentralized approach.  ( 2 min )
    PriorCVAE: scalable MCMC parameter inference with Bayesian deep generative modelling. (arXiv:2304.04307v3 [stat.ML] UPDATED)
    Recent advances have shown that GP priors, or their finite realisations, can be encoded using deep generative models such as variational autoencoders (VAEs). These learned generators can serve as drop-in replacements for the original priors during MCMC inference. While this approach enables efficient inference, it loses information about the hyperparameters of the original models, and consequently makes inference over hyperparameters impossible and the learned priors indistinct. To overcome this limitation, we condition the VAE on stochastic process hyperparameters. This allows the joint encoding of hyperparameters with GP realizations and their subsequent estimation during inference. Further, we demonstrate that our proposed method, PriorCVAE, is agnostic to the nature of the models which it approximates, and can be used, for instance, to encode solutions of ODEs. It provides a practical tool for approximate inference and shows potential in real-life spatial and spatiotemporal applications.  ( 2 min )
    Preemptively Pruning Clever-Hans Strategies in Deep Neural Networks. (arXiv:2304.05727v3 [cs.LG] UPDATED)
    Robustness has become an important consideration in deep learning. With the help of explainable AI, mismatches between an explained model's decision strategy and the user's domain knowledge (e.g. Clever Hans effects) have been identified as a starting point for improving faulty models. However, it is less clear what to do when the user and the explanation agree. In this paper, we demonstrate that acceptance of explanations by the user is not a guarantee for a machine learning model to be robust against Clever Hans effects, which may remain undetected. Such hidden flaws of the model can nevertheless be mitigated, and we demonstrate this by contributing a new method, Explanation-Guided Exposure Minimization (EGEM), that preemptively prunes variations in the ML model that have not been the subject of positive explanation feedback. Experiments demonstrate that our approach leads to models that strongly reduce their reliance on hidden Clever Hans strategies, and consequently achieve higher accuracy on new data.  ( 2 min )
    Physics-Informed Neural Networks for Time-Domain Simulations: Accuracy, Computational Cost, and Flexibility. (arXiv:2303.08994v2 [eess.SY] UPDATED)
    The simulation of power system dynamics poses a computationally expensive task. Considering the growing uncertainty of generation and demand patterns, thousands of scenarios need to be continuously assessed to ensure the safety of power systems. Physics-Informed Neural Networks (PINNs) have recently emerged as a promising solution for drastically accelerating computations of non-linear dynamical systems. This work investigates the applicability of these methods for power system dynamics, focusing on the dynamic response to load disturbances. Comparing the prediction of PINNs to the solution of conventional solvers, we find that PINNs can be 10 to 1000 times faster than conventional solvers. At the same time, we find them to be sufficiently accurate and numerically stable even for large time steps. To facilitate a deeper understanding, this paper also present a new regularisation of Neural Network (NN) training by introducing a gradient-based term in the loss function. The resulting NNs, which we call dtNNs, help us deliver a comprehensive analysis about the strengths and weaknesses of the NN based approaches, how incorporating knowledge of the underlying physics affects NN performance, and how this compares with conventional solvers for power system dynamics.  ( 2 min )
    Low-Multi-Rank High-Order Bayesian Robust Tensor Factorization. (arXiv:2311.05888v1 [cs.LG])
    The recently proposed tensor robust principal component analysis (TRPCA) methods based on tensor singular value decomposition (t-SVD) have achieved numerous successes in many fields. However, most of these methods are only applicable to third-order tensors, whereas the data obtained in practice are often of higher order, such as fourth-order color videos, fourth-order hyperspectral videos, and fifth-order light-field images. Additionally, in the t-SVD framework, the multi-rank of a tensor can describe more fine-grained low-rank structure in the tensor compared with the tubal rank. However, determining the multi-rank of a tensor is a much more difficult problem than determining the tubal rank. Moreover, most of the existing TRPCA methods do not explicitly model the noises except the sparse noise, which may compromise the accuracy of estimating the low-rank tensor. In this work, we propose a novel high-order TRPCA method, named as Low-Multi-rank High-order Bayesian Robust Tensor Factorization (LMH-BRTF), within the Bayesian framework. Specifically, we decompose the observed corrupted tensor into three parts, i.e., the low-rank component, the sparse component, and the noise component. By constructing a low-rank model for the low-rank component based on the order-$d$ t-SVD and introducing a proper prior for the model, LMH-BRTF can automatically determine the tensor multi-rank. Meanwhile, benefiting from the explicit modeling of both the sparse and noise components, the proposed method can leverage information from the noises more effectivly, leading to an improved performance of TRPCA. Then, an efficient variational inference algorithm is established for parameters estimation. Empirical studies on synthetic and real-world datasets demonstrate the effectiveness of the proposed method in terms of both qualitative and quantitative results.  ( 3 min )
    Transformers learn to implement preconditioned gradient descent for in-context learning. (arXiv:2306.00297v2 [cs.LG] UPDATED)
    Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent. Going beyond the question of expressivity, we ask: Can transformers learn to implement such algorithms by training over random problem instances? To our knowledge, we make the first theoretical progress on this question via an analysis of the loss landscape for linear transformers trained over random instances of linear regression. For a single attention layer, we prove the global minimum of the training objective implements a single iteration of preconditioned gradient descent. Notably, the preconditioning matrix not only adapts to the input distribution but also to the variance induced by data inadequacy. For a transformer with $L$ attention layers, we prove certain critical points of the training objective implement $L$ iterations of preconditioned gradient descent. Our results call for future theoretical studies on learning algorithms by training transformers.  ( 2 min )
    Learning with Exposure Constraints in Recommendation Systems. (arXiv:2302.01377v2 [cs.LG] UPDATED)
    Recommendation systems are dynamic economic systems that balance the needs of multiple stakeholders. A recent line of work studies incentives from the content providers' point of view. Content providers, e.g., vloggers and bloggers, contribute fresh content and rely on user engagement to create revenue and finance their operations. In this work, we propose a contextual multi-armed bandit setting to model the dependency of content providers on exposure. In our model, the system receives a user context in every round and has to select one of the arms. Every arm is a content provider who must receive a minimum number of pulls every fixed time period (e.g., a month) to remain viable in later rounds; otherwise, the arm departs and is no longer available. The system aims to maximize the users' (content consumers) welfare. To that end, it should learn which arms are vital and ensure they remain viable by subsidizing arm pulls if needed. We develop algorithms with sub-linear regret, as well as a lower bound that demonstrates that our algorithms are optimal up to logarithmic factors.  ( 2 min )
    Machine Learning-powered Compact Modeling of Stochastic Electronic Devices using Mixture Density Networks. (arXiv:2311.05820v1 [cs.LG])
    The relentless pursuit of miniaturization and performance enhancement in electronic devices has led to a fundamental challenge in the field of circuit design and simulation: how to accurately account for the inherent stochastic nature of certain devices. While conventional deterministic models have served as indispensable tools for circuit designers, they fall short when it comes to capture the subtle yet critical variability exhibited by many electronic components. In this paper, we present an innovative approach that transcends the limitations of traditional modeling techniques by harnessing the power of machine learning, specifically Mixture Density Networks (MDNs), to faithfully represent and simulate the stochastic behavior of electronic devices. We demonstrate our approach to model heater cryotrons, where the model is able to capture the stochastic switching dynamics observed in the experiment. Our model shows 0.82% mean absolute error for switching probability. This paper marks a significant step forward in the quest for accurate and versatile compact models, poised to drive innovation in the realm of electronic circuits.  ( 2 min )
    Minimum norm interpolation by perceptra: Explicit regularization and implicit bias. (arXiv:2311.06138v1 [stat.ML])
    We investigate how shallow ReLU networks interpolate between known regions. Our analysis shows that empirical risk minimizers converge to a minimum norm interpolant as the number of data points and parameters tends to infinity when a weight decay regularizer is penalized with a coefficient which vanishes at a precise rate as the network width and the number of data points grow. With and without explicit regularization, we numerically study the implicit bias of common optimization algorithms towards known minimum norm interpolants.  ( 2 min )
    A Semi-Bayesian Nonparametric Estimator of the Maximum Mean Discrepancy Measure: Applications in Goodness-of-Fit Testing and Generative Adversarial Networks. (arXiv:2303.02637v2 [stat.ML] UPDATED)
    A classic inferential statistical problem is the goodness-of-fit (GOF) test. Such a test can be challenging when the hypothesized parametric model has an intractable likelihood and its distributional form is not available. Bayesian methods for GOF can be appealing due to their ability to incorporate expert knowledge through prior distributions. However, standard Bayesian methods for this test often require strong distributional assumptions on the data and their relevant parameters. To address this issue, we propose a semi-Bayesian nonparametric (semi-BNP) procedure in the context of the maximum mean discrepancy (MMD) measure that can be applied to the GOF test. Our method introduces a novel Bayesian estimator for the MMD, enabling the development of a measure-based hypothesis test for intractable models. Through extensive experiments, we demonstrate that our proposed test outperforms frequentist MMD-based methods by achieving a lower false rejection and acceptance rate of the null hypothesis. Furthermore, we showcase the versatility of our approach by embedding the proposed estimator within a generative adversarial network (GAN) framework. It facilitates a robust BNP learning approach as another significant application of our method. With our BNP procedure, this new GAN approach can enhance sample diversity and improve inferential accuracy compared to traditional techniques.  ( 3 min )
    PRIOR: Personalized Prior for Reactivating the Information Overlooked in Federated Learning. (arXiv:2310.09183v2 [cs.LG] UPDATED)
    Classical federated learning (FL) enables training machine learning models without sharing data for privacy preservation, but heterogeneous data characteristic degrades the performance of the localized model. Personalized FL (PFL) addresses this by synthesizing personalized models from a global model via training on local data. Such a global model may overlook the specific information that the clients have been sampled. In this paper, we propose a novel scheme to inject personalized prior knowledge into the global model in each client, which attempts to mitigate the introduced incomplete information problem in PFL. At the heart of our proposed approach is a framework, the PFL with Bregman Divergence (pFedBreD), decoupling the personalized prior from the local objective function regularized by Bregman divergence for greater adaptability in personalized scenarios. We also relax the mirror descent (RMD) to extract the prior explicitly to provide optional strategies. Additionally, our pFedBreD is backed up by a convergence analysis. Sufficient experiments demonstrate that our method reaches the state-of-the-art performances on 5 datasets and outperforms other methods by up to 3.5% across 8 benchmarks. Extensive analyses verify the robustness and necessity of proposed designs.  ( 2 min )
    Dual input stream transformer for eye-tracking line assignment. (arXiv:2311.06095v1 [cs.CV])
    We introduce a novel Dual Input Stream Transformer (DIST) for the challenging problem of assigning fixation points from eye-tracking data collected during passage reading to the line of text that the reader was actually focused on. This post-processing step is crucial for analysis of the reading data due to the presence of noise in the form of vertical drift. We evaluate DIST against nine classical approaches on a comprehensive suite of nine diverse datasets, and demonstrate DIST's superiority. By combining multiple instances of the DIST model in an ensemble we achieve an average accuracy of 98.5\% across all datasets. Our approach presents a significant step towards addressing the bottleneck of manual line assignment in reading research. Through extensive model analysis and ablation studies, we identify key factors that contribute to DIST's success, including the incorporation of line overlap features and the use of a second input stream. Through evaluation on a set of diverse datasets we demonstrate that DIST is robust to various experimental setups, making it a safe first choice for practitioners in the field.  ( 2 min )
    Exploring the Efficacy of Base Data Augmentation Methods in Deep Learning-Based Radiograph Classification of Knee Joint Osteoarthritis. (arXiv:2311.06118v1 [eess.IV])
    Diagnosing knee joint osteoarthritis (KOA), a major cause of disability worldwide, is challenging due to subtle radiographic indicators and the varied progression of the disease. Using deep learning for KOA diagnosis requires broad, comprehensive datasets. However, obtaining these datasets poses significant challenges due to patient privacy concerns and data collection restrictions. Additive data augmentation, which enhances data variability, emerges as a promising solution. Yet, it's unclear which augmentation techniques are most effective for KOA. This study explored various data augmentation methods, including adversarial augmentations, and their impact on KOA classification model performance. While some techniques improved performance, others commonly used underperformed. We identified potential confounding regions within the images using adversarial augmentation. This was evidenced by our models' ability to classify KL0 and KL4 grades accurately, with the knee joint omitted. This observation suggested a model bias, which might leverage unrelated features for classification currently present in radiographs. Interestingly, removing the knee joint also led to an unexpected improvement in KL1 classification accuracy. To better visualize these paradoxical effects, we employed Grad-CAM, highlighting the associated regions. Our study underscores the need for careful technique selection for improved model performance and identifying and managing potential confounding regions in radiographic KOA deep learning.  ( 3 min )
    An Experimental Design for Anytime-Valid Causal Inference on Multi-Armed Bandits. (arXiv:2311.05794v1 [stat.ME])
    Typically, multi-armed bandit (MAB) experiments are analyzed at the end of the study and thus require the analyst to specify a fixed sample size in advance. However, in many online learning applications, it is advantageous to continuously produce inference on the average treatment effect (ATE) between arms as new data arrive and determine a data-driven stopping time for the experiment. Existing work on continuous inference for adaptive experiments assumes that the treatment assignment probabilities are bounded away from zero and one, thus excluding nearly all standard bandit algorithms. In this work, we develop the Mixture Adaptive Design (MAD), a new experimental design for multi-armed bandits that enables continuous inference on the ATE with guarantees on statistical validity and power for nearly any bandit algorithm. On a high level, the MAD "mixes" a bandit algorithm of the user's choice with a Bernoulli design through a tuning parameter $\delta_t$, where $\delta_t$ is a deterministic sequence that controls the priority placed on the Bernoulli design as the sample size grows. We show that for $\delta_t = o\left(1/t^{1/4}\right)$, the MAD produces a confidence sequence that is asymptotically valid and guaranteed to shrink around the true ATE. We empirically show that the MAD improves the coverage and power of ATE inference in MAB experiments without significant losses in finite-sample reward.  ( 2 min )
    TAKDE: Temporal Adaptive Kernel Density Estimator for Real-Time Dynamic Density Estimation. (arXiv:2203.08317v2 [stat.ML] UPDATED)
    Real-time density estimation is ubiquitous in many applications, including computer vision and signal processing. Kernel density estimation is arguably one of the most commonly used density estimation techniques, and the use of "sliding window" mechanism adapts kernel density estimators to dynamic processes. In this paper, we derive the asymptotic mean integrated squared error (AMISE) upper bound for the "sliding window" kernel density estimator. This upper bound provides a principled guide to devise a novel estimator, which we name the temporal adaptive kernel density estimator (TAKDE). Compared to heuristic approaches for "sliding window" kernel density estimator, TAKDE is theoretically optimal in terms of the worst-case AMISE. We provide numerical experiments using synthetic and real-world datasets, showing that TAKDE outperforms other state-of-the-art dynamic density estimators (including those outside of kernel family). In particular, TAKDE achieves a superior test log-likelihood with a smaller runtime.  ( 2 min )
    Multi-Task Imitation Learning for Linear Dynamical Systems. (arXiv:2212.00186v3 [cs.LG] UPDATED)
    We study representation learning for efficient imitation learning over linear systems. In particular, we consider a setting where learning is split into two phases: (a) a pre-training step where a shared $k$-dimensional representation is learned from $H$ source policies, and (b) a target policy fine-tuning step where the learned representation is used to parameterize the policy class. We find that the imitation gap over trajectories generated by the learned target policy is bounded by $\tilde{O}\left( \frac{k n_x}{HN_{\mathrm{shared}}} + \frac{k n_u}{N_{\mathrm{target}}}\right)$, where $n_x > k$ is the state dimension, $n_u$ is the input dimension, $N_{\mathrm{shared}}$ denotes the total amount of data collected for each policy during representation learning, and $N_{\mathrm{target}}$ is the amount of target task data. This result formalizes the intuition that aggregating data across related tasks to learn a representation can significantly improve the sample efficiency of learning a target task. The trends suggested by this bound are corroborated in simulation.  ( 2 min )
    Distributionally Robust Skeleton Learning of Discrete Bayesian Networks. (arXiv:2311.06117v1 [cs.LG])
    We consider the problem of learning the exact skeleton of general discrete Bayesian networks from potentially corrupted data. Building on distributionally robust optimization and a regression approach, we propose to optimize the most adverse risk over a family of distributions within bounded Wasserstein distance or KL divergence to the empirical distribution. The worst-case risk accounts for the effect of outliers. The proposed approach applies for general categorical random variables without assuming faithfulness, an ordinal relationship or a specific form of conditional distribution. We present efficient algorithms and show the proposed methods are closely related to the standard regularized regression approach. Under mild assumptions, we derive non-asymptotic guarantees for successful structure learning with logarithmic sample complexities for bounded-degree graphs. Numerical study on synthetic and real datasets validates the effectiveness of our method. Code is available at https://github.com/DanielLeee/drslbn.  ( 2 min )
    Forecasting Response to Treatment with Global Deep Learning and Patient-Specific Pharmacokinetic Priors. (arXiv:2309.13135v5 [cs.LG] UPDATED)
    Forecasting healthcare time series is crucial for early detection of adverse outcomes and for patient monitoring. Forecasting, however, can be difficult in practice due to noisy and intermittent data. The challenges are often exacerbated by change points induced via extrinsic factors, such as the administration of medication. To address these challenges, we propose a novel hybrid global-local architecture and a pharmacokinetic encoder that informs deep learning models of patient-specific treatment effects. We showcase the efficacy of our approach in achieving significant accuracy gains for a blood glucose forecasting task using both realistically simulated and real-world data. Our global-local architecture improves over patient-specific models by 9.2-14.6%. Additionally, our pharmacokinetic encoder improves over alternative encoding techniques by 4.4% on simulated data and 2.1% on real-world data. The proposed approach can have multiple beneficial applications in clinical practice, such as issuing early warnings about unexpected treatment responses, or helping to characterize patient-specific treatment effects in terms of drug absorption and elimination characteristics.  ( 2 min )
    A Safe Deep Reinforcement Learning Approach for Energy Efficient Federated Learning in Wireless Communication Networks. (arXiv:2308.10664v2 [cs.LG] UPDATED)
    Progressing towards a new era of Artificial Intelligence (AI) - enabled wireless networks, concerns regarding the environmental impact of AI have been raised both in industry and academia. Federated Learning (FL) has emerged as a key privacy preserving decentralized AI technique. Despite efforts currently being made in FL, its environmental impact is still an open problem. Targeting the minimization of the overall energy consumption of an FL process, we propose the orchestration of computational and communication resources of the involved devices to minimize the total energy required, while guaranteeing a certain performance of the model. To this end, we propose a Soft Actor Critic Deep Reinforcement Learning (DRL) solution, where a penalty function is introduced during training, penalizing the strategies that violate the constraints of the environment, and contributing towards a safe RL process. A device level synchronization method, along with a computationally cost effective FL environment are proposed, with the goal of further reducing the energy consumption and communication overhead. Evaluation results show the effectiveness and robustness of the proposed scheme compared to four state-of-the-art baseline solutions on different network environments and FL architectures, achieving a decrease of up to 94% in the total energy consumption.  ( 3 min )
    ID Embedding as Subtle Features of Content and Structure for Multimodal Recommendation. (arXiv:2311.05956v1 [cs.IR])
    Multimodal recommendation aims to model user and item representations comprehensively with the involvement of multimedia content for effective recommendations. Existing research has shown that it is beneficial for recommendation performance to combine (user- and item-) ID embeddings with multimodal salient features, indicating the value of IDs. However, there is a lack of a thorough analysis of the ID embeddings in terms of feature semantics in the literature. In this paper, we revisit the value of ID embeddings for multimodal recommendation and conduct a thorough study regarding its semantics, which we recognize as subtle features of content and structures. Then, we propose a novel recommendation model by incorporating ID embeddings to enhance the semantic features of both content and structures. Specifically, we put forward a hierarchical attention mechanism to incorporate ID embeddings in modality fusing, coupled with contrastive learning, to enhance content representations. Meanwhile, we propose a lightweight graph convolutional network for each modality to amalgamate neighborhood and ID embeddings for improving structural representations. Finally, the content and structure representations are combined to form the ultimate item embedding for recommendation. Extensive experiments on three real-world datasets (Baby, Sports, and Clothing) demonstrate the superiority of our method over state-of-the-art multimodal recommendation methods and the effectiveness of fine-grained ID embeddings.  ( 2 min )
    Are "Hierarchical" Visual Representations Hierarchical?. (arXiv:2311.05784v1 [cs.CV])
    Learned visual representations often capture large amounts of semantic information for accurate downstream applications. Human understanding of the world is fundamentally grounded in hierarchy. To mimic this and further improve representation capabilities, the community has explored "hierarchical" visual representations that aim at modeling the underlying hierarchy of the visual world. In this work, we set out to investigate if hierarchical visual representations truly capture the human perceived hierarchy better than standard learned representations. To this end, we create HierNet, a suite of 12 datasets spanning 3 kinds of hierarchy from the BREEDs subset of ImageNet. After extensive evaluation of Hyperbolic and Matryoshka Representations across training setups, we conclude that they do not capture hierarchy any better than the standard representations but can assist in other aspects like search efficiency and interpretability. Our benchmark and the datasets are open-sourced at https://github.com/ethanlshen/HierNet.  ( 2 min )
    Does Differential Privacy Prevent Backdoor Attacks in Practice?. (arXiv:2311.06227v1 [cs.CR])
    Differential Privacy (DP) was originally developed to protect privacy. However, it has recently been utilized to secure machine learning (ML) models from poisoning attacks, with DP-SGD receiving substantial attention. Nevertheless, a thorough investigation is required to assess the effectiveness of different DP techniques in preventing backdoor attacks in practice. In this paper, we investigate the effectiveness of DP-SGD and, for the first time in literature, examine PATE in the context of backdoor attacks. We also explore the role of different components of DP algorithms in defending against backdoor attacks and will show that PATE is effective against these attacks due to the bagging structure of the teacher models it employs. Our experiments reveal that hyperparameters and the number of backdoors in the training dataset impact the success of DP algorithms. Additionally, we propose Label-DP as a faster and more accurate alternative to DP-SGD and PATE. We conclude that while Label-DP algorithms generally offer weaker privacy protection, accurate hyper-parameter tuning can make them more effective than DP methods in defending against backdoor attacks while maintaining model accuracy.  ( 2 min )
    Efficient Semi-Supervised Federated Learning for Heterogeneous Participants. (arXiv:2307.15870v2 [cs.LG] UPDATED)
    Federated Learning (FL) has emerged to allow multiple clients to collaboratively train machine learning models on their private data. However, training and deploying large-scale models on resource-constrained clients is challenging. Fortunately, Split Federated Learning (SFL) offers a feasible solution by alleviating the computation and/or communication burden on clients. However, existing SFL works often assume sufficient labeled data on clients, which is usually impractical. Besides, data non-IIDness across clients poses another challenge to ensure efficient model training. To our best knowledge, the above two issues have not been simultaneously addressed in SFL. Herein, we propose a novel Semi-SFL system, which incorporates clustering regularization to perform SFL under the more practical scenario with unlabeled and non-IID client data. Moreover, our theoretical and experimental investigations into model convergence reveal that the inconsistent training processes on labeled and unlabeled data have an influence on the effectiveness of clustering regularization. To this end, we develop a control algorithm for dynamically adjusting the global updating frequency, so as to mitigate the training inconsistency and improve training performance. Extensive experiments on benchmark models and datasets show that our system provides a 3.0x speed-up in training time and reduces the communication cost by about 70.3% while reaching the target accuracy, and achieves up to 5.1% improvement in accuracy under non-IID scenarios compared to the state-of-the-art baselines.  ( 2 min )
    Anytime-Valid Confidence Sequences for Consistent Uncertainty Estimation in Early-Exit Neural Networks. (arXiv:2311.05931v1 [cs.LG])
    Early-exit neural networks (EENNs) facilitate adaptive inference by producing predictions at multiple stages of the forward pass. In safety-critical applications, these predictions are only meaningful when complemented with reliable uncertainty estimates. Yet, due to their sequential structure, an EENN's uncertainty estimates should also be consistent: labels that are deemed improbable at one exit should not reappear within the confidence interval / set of later exits. We show that standard uncertainty quantification techniques, like Bayesian methods or conformal prediction, can lead to inconsistency across exits. We address this problem by applying anytime-valid confidence sequences (AVCSs) to the exits of EENNs. By design, AVCSs maintain consistency across exits. We examine the theoretical and practical challenges of applying AVCSs to EENNs and empirically validate our approach on both regression and classification tasks.  ( 2 min )
    An Interpretable Machine Learning Framework to Understand Bikeshare Demand before and during the COVID-19 Pandemic in New York City. (arXiv:2311.06110v1 [cs.LG])
    In recent years, bikesharing systems have become increasingly popular as affordable and sustainable micromobility solutions. Advanced mathematical models such as machine learning are required to generate good forecasts for bikeshare demand. To this end, this study proposes a machine learning modeling framework to estimate hourly demand in a large-scale bikesharing system. Two Extreme Gradient Boosting models were developed: one using data from before the COVID-19 pandemic (March 2019 to February 2020) and the other using data from during the pandemic (March 2020 to February 2021). Furthermore, a model interpretation framework based on SHapley Additive exPlanations was implemented. Based on the relative importance of the explanatory variables considered in this study, share of female users and hour of day were the two most important explanatory variables in both models. However, the month variable had higher importance in the pandemic model than in the pre-pandemic model.
    In-Context Learning for MIMO Equalization Using Transformer-Based Sequence Models. (arXiv:2311.06101v1 [cs.IT])
    Large pre-trained sequence models, such as transformer-based architectures, have been recently shown to have the capacity to carry out in-context learning (ICL). In ICL, a decision on a new input is made via a direct mapping of the input and of a few examples from the given task, serving as the task's context, to the output variable. No explicit updates of model parameters are needed to tailor the decision to a new task. Pre-training, which amounts to a form of meta-learning, is based on the observation of examples from several related tasks. Prior work has shown ICL capabilities for linear regression. In this study, we leverage ICL to address the inverse problem of multiple-input and multiple-output (MIMO) equalization based on a context given by pilot symbols. A task is defined by the unknown fading channel and by the signal-to-noise ratio (SNR) level, which may be known. To highlight the practical potential of the approach, we allow for the presence of quantization of the received signals. We demonstrate via numerical results that transformer-based ICL has a threshold behavior, whereby, as the number of pre-training tasks grows, the performance switches from that of a minimum mean squared error (MMSE) equalizer with a prior determined by the pre-trained tasks to that of an MMSE equalizer with the true data-generating prior.
    Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers. (arXiv:2305.17328v2 [cs.CV] UPDATED)
    Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exponentially growing inference cost that scales quadratically with the number of tokens in the input sequence. Token pruning is an emerging solution to address this challenge due to its ease of deployment on various Transformer backbones. However, most token pruning methods require computationally expensive fine-tuning, which is undesirable in many edge deployment cases. In this work, we propose Zero-TPrune, the first zero-shot method that considers both the importance and similarity of tokens in performing token pruning. It leverages the attention graph of pre-trained Transformer models to produce an importance distribution for tokens via our proposed Weighted Page Rank (WPR) algorithm. This distribution further guides token partitioning for efficient similarity-based pruning. Due to the elimination of the fine-tuning overhead, Zero-TPrune can prune large models at negligible computational cost, switch between different pruning configurations at no computational cost, and perform hyperparameter tuning efficiently. We evaluate the performance of Zero-TPrune on vision tasks by applying it to various vision Transformer backbones and testing them on ImageNet. Without any fine-tuning, Zero-TPrune reduces the FLOPs cost of DeiT-S by 34.7\% and improves its throughput by 45.3\% with only 0.4\% accuracy loss. Compared with state-of-the-art pruning methods that require fine-tuning, Zero-TPrune not only eliminates the need for fine-tuning after pruning but also does so with only 0.1\% accuracy loss. Compared with state-of-the-art fine-tuning-free pruning methods, Zero-TPrune reduces accuracy loss by up to 49\% with the same or higher throughput.
    Greedy PIG: Adaptive Integrated Gradients. (arXiv:2311.06192v1 [cs.LG])
    Deep learning has become the standard approach for most machine learning tasks. While its impact is undeniable, interpreting the predictions of deep learning models from a human perspective remains a challenge. In contrast to model training, model interpretability is harder to quantify and pose as an explicit optimization problem. Inspired by the AUC softmax information curve (AUC SIC) metric for evaluating feature attribution methods, we propose a unified discrete optimization framework for feature attribution and feature selection based on subset selection. This leads to a natural adaptive generalization of the path integrated gradients (PIG) method for feature attribution, which we call Greedy PIG. We demonstrate the success of Greedy PIG on a wide variety of tasks, including image feature attribution, graph compression/explanation, and post-hoc feature selection on tabular data. Our results show that introducing adaptivity is a powerful and versatile method for making attribution methods more powerful.
    Uncertainty-aware Single View Volumetric Rendering for Medical Neural Radiance Fields. (arXiv:2311.05836v1 [eess.IV])
    In the field of clinical medicine, computed tomography (CT) is an effective medical imaging modality for the diagnosis of various pathologies. Compared with X-ray images, CT images can provide more information, including multi-planar slices and three-dimensional structures for clinical diagnosis. However, CT imaging requires patients to be exposed to large doses of ionizing radiation for a long time, which may cause irreversible physical harm. In this paper, we propose an Uncertainty-aware MedNeRF (UMedNeRF) network based on generated radiation fields. The network can learn a continuous representation of CT projections from 2D X-ray images by obtaining the internal structure and depth information and using adaptive loss weights to ensure the quality of the generated images. Our model is trained on publicly available knee and chest datasets, and we show the results of CT projection rendering with a single X-ray and compare our method with other methods based on generated radiation fields.
    MultiIoT: Towards Large-scale Multisensory Learning for the Internet of Things. (arXiv:2311.06217v1 [cs.LG])
    The Internet of Things (IoT), the network integrating billions of smart physical devices embedded with sensors, software, and communication technologies for the purpose of connecting and exchanging data with other devices and systems, is a critical and rapidly expanding component of our modern world. The IoT ecosystem provides a rich source of real-world modalities such as motion, thermal, geolocation, imaging, depth, sensors, video, and audio for prediction tasks involving the pose, gaze, activities, and gestures of humans as well as the touch, contact, pose, 3D of physical objects. Machine learning presents a rich opportunity to automatically process IoT data at scale, enabling efficient inference for impact in understanding human wellbeing, controlling physical devices, and interconnecting smart cities. To develop machine learning technologies for IoT, this paper proposes MultiIoT, the most expansive IoT benchmark to date, encompassing over 1.15 million samples from 12 modalities and 8 tasks. MultiIoT introduces unique challenges involving (1) learning from many sensory modalities, (2) fine-grained interactions across long temporal ranges, and (3) extreme heterogeneity due to unique structure and noise topologies in real-world sensors. We also release a set of strong modeling baselines, spanning modality and task-specific methods to multisensory and multitask models to encourage future research in multisensory representation learning for IoT.
    1-Lipschitz Neural Networks are more expressive with N-Activations. (arXiv:2311.06103v1 [cs.LG])
    A crucial property for achieving secure, trustworthy and interpretable deep learning systems is their robustness: small changes to a system's inputs should not result in large changes to its outputs. Mathematically, this means one strives for networks with a small Lipschitz constant. Several recent works have focused on how to construct such Lipschitz networks, typically by imposing constraints on the weight matrices. In this work, we study an orthogonal aspect, namely the role of the activation function. We show that commonly used activation functions, such as MaxMin, as well as all piece-wise linear ones with two segments unnecessarily restrict the class of representable functions, even in the simplest one-dimensional setting. We furthermore introduce the new N-activation function that is provably more expressive than currently popular activation functions. We provide code at https://github.com/berndprach/NActivation.
    Going beyond persistent homology using persistent homology. (arXiv:2311.06152v1 [cs.LG])
    Representational limits of message-passing graph neural networks (MP-GNNs), e.g., in terms of the Weisfeiler-Leman (WL) test for isomorphism, are well understood. Augmenting these graph models with topological features via persistent homology (PH) has gained prominence, but identifying the class of attributed graphs that PH can recognize remains open. We introduce a novel concept of color-separating sets to provide a complete resolution to this important problem. Specifically, we establish the necessary and sufficient conditions for distinguishing graphs based on the persistence of their connected components, obtained from filter functions on vertex and edge colors. Our constructions expose the limits of vertex- and edge-level PH, proving that neither category subsumes the other. Leveraging these theoretical insights, we propose RePHINE for learning topological features on graphs. RePHINE efficiently combines vertex- and edge-level PH, achieving a scheme that is provably more powerful than both. Integrating RePHINE into MP-GNNs boosts their expressive power, resulting in gains over standard PH on several benchmarks for graph classification.
    Fair Supervised Learning with A Simple Random Sampler of Sensitive Attributes. (arXiv:2311.05866v1 [stat.ML])
    As the data-driven decision process becomes dominating for industrial applications, fairness-aware machine learning arouses great attention in various areas. This work proposes fairness penalties learned by neural networks with a simple random sampler of sensitive attributes for non-discriminatory supervised learning. In contrast to many existing works that critically rely on the discreteness of sensitive attributes and response variables, the proposed penalty is able to handle versatile formats of the sensitive attributes, so it is more extensively applicable in practice than many existing algorithms. This penalty enables us to build a computationally efficient group-level in-processing fairness-aware training framework. Empirical evidence shows that our framework enjoys better utility and fairness measures on popular benchmark data sets than competing methods. We also theoretically characterize estimation errors and loss of utility of the proposed neural-penalized risk minimization problem.
    FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores. (arXiv:2311.05908v1 [cs.LG])
    Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time. A major bottleneck is the Fast Fourier Transform (FFT)--which allows long convolutions to run in $O(N logN)$ time in sequence length $N$ but has poor hardware utilization. In this paper, we study how to optimize the FFT convolution. We find two key bottlenecks: the FFT does not effectively use specialized matrix multiply units, and it incurs expensive I/O between layers of the memory hierarchy. In response, we propose FlashFFTConv. FlashFFTConv uses a matrix decomposition that computes the FFT using matrix multiply units and enables kernel fusion for long sequences, reducing I/O. We also present two sparse convolution algorithms--1) partial convolutions and 2) frequency-sparse convolutions--which can be implemented simply by skipping blocks in the matrix decomposition, enabling further opportunities for memory and compute savings. FlashFFTConv speeds up exact FFT convolutions by up to 7.93$\times$ over PyTorch and achieves up to 4.4$\times$ speedup end-to-end. Given the same compute budget, FlashFFTConv allows Hyena-GPT-s to achieve 2.3 points better perplexity on the PILE and M2-BERT-base to achieve 3.3 points higher GLUE score--matching models with twice the parameter count. FlashFFTConv also achieves 96.1% accuracy on Path-512, a high-resolution vision task where no model had previously achieved better than 50%. Furthermore, partial convolutions enable longer-sequence models--yielding the first DNA model that can process the longest human genes (2.3M base pairs)--and frequency-sparse convolutions speed up pretrained models while maintaining or improving model quality.
    Multiscale Neural Operators for Solving Time-Independent PDEs. (arXiv:2311.05964v1 [cs.LG])
    Time-independent Partial Differential Equations (PDEs) on large meshes pose significant challenges for data-driven neural PDE solvers. We introduce a novel graph rewiring technique to tackle some of these challenges, such as aggregating information across scales and on irregular meshes. Our proposed approach bridges distant nodes, enhancing the global interaction capabilities of GNNs. Our experiments on three datasets reveal that GNN-based methods set new performance standards for time-independent PDEs on irregular meshes. Finally, we show that our graph rewiring strategy boosts the performance of baseline methods, achieving state-of-the-art results in one of the tasks.
    Sum-max Submodular Bandits. (arXiv:2311.05975v1 [cs.LG])
    Many online decision-making problems correspond to maximizing a sequence of submodular functions. In this work, we introduce sum-max functions, a subclass of monotone submodular functions capturing several interesting problems, including best-of-$K$-bandits, combinatorial bandits, and the bandit versions on facility location, $M$-medians, and hitting sets. We show that all functions in this class satisfy a key property that we call pseudo-concavity. This allows us to prove $\big(1 - \frac{1}{e}\big)$-regret bounds for bandit feedback in the nonstochastic setting of the order of $\sqrt{MKT}$ (ignoring log factors), where $T$ is the time horizon and $M$ is a cardinality constraint. This bound, attained by a simple and efficient algorithm, significantly improves on the $\widetilde{O}\big(T^{2/3}\big)$ regret bound for online monotone submodular maximization with bandit feedback.
    Optimal Cooperative Multiplayer Learning Bandits with Noisy Rewards and No Communication. (arXiv:2311.06210v1 [cs.LG])
    We consider a cooperative multiplayer bandit learning problem where the players are only allowed to agree on a strategy beforehand, but cannot communicate during the learning process. In this problem, each player simultaneously selects an action. Based on the actions selected by all players, the team of players receives a reward. The actions of all the players are commonly observed. However, each player receives a noisy version of the reward which cannot be shared with other players. Since players receive potentially different rewards, there is an asymmetry in the information used to select their actions. In this paper, we provide an algorithm based on upper and lower confidence bounds that the players can use to select their optimal actions despite the asymmetry in the reward information. We show that this algorithm can achieve logarithmic $O(\frac{\log T}{\Delta_{\bm{a}}})$ (gap-dependent) regret as well as $O(\sqrt{T\log T})$ (gap-independent) regret. This is asymptotically optimal in $T$. We also show that it performs empirically better than the current state of the art algorithm for this environment.
    Learning-Augmented Scheduling for Solar-Powered Electric Vehicle Charging. (arXiv:2311.05941v1 [eess.SY])
    We tackle the complex challenge of scheduling the charging of electric vehicles (EVs) equipped with solar panels and batteries, particularly under out-of-distribution (OOD) conditions. Traditional scheduling approaches, such as reinforcement learning (RL) and model predictive control (MPC), often fail to provide satisfactory results when faced with OOD data, struggling to balance robustness (worst-case performance) and consistency (near-optimal average performance). To address this gap, we introduce a novel learning-augmented policy. This policy employs a dynamic robustness budget, which is adapted in real-time based on the reinforcement learning policy's performance. Specifically, it leverages the temporal difference (TD) error, a measure of the learning policy's prediction accuracy, to assess the trustworthiness of the machine-learned policy. This method allows for a more effective balance between consistency and robustness in EV charging schedules, significantly enhancing adaptability and efficiency in real-world, unpredictable environments. Our results demonstrate that this approach markedly improves scheduling effectiveness and reliability, particularly in OOD contexts, paving the way for more resilient and adaptive EV charging systems.
    Symbolic Regression as Feature Engineering Method for Machine and Deep Learning Regression Tasks. (arXiv:2311.06028v1 [cs.LG])
    In the realm of machine and deep learning regression tasks, the role of effective feature engineering (FE) is pivotal in enhancing model performance. Traditional approaches of FE often rely on domain expertise to manually design features for machine learning models. In the context of deep learning models, the FE is embedded in the neural network's architecture, making it hard for interpretation. In this study, we propose to integrate symbolic regression (SR) as an FE process before a machine learning model to improve its performance. We show, through extensive experimentation on synthetic and real-world physics-related datasets, that the incorporation of SR-derived features significantly enhances the predictive capabilities of both machine and deep learning regression models with 34-86% root mean square error (RMSE) improvement in synthetic datasets and 4-11.5% improvement in real-world datasets. In addition, as a realistic use-case, we show the proposed method improves the machine learning performance in predicting superconducting critical temperatures based on Eliashberg theory by more than 20% in terms of RMSE. These results outline the potential of SR as an FE component in data-driven models.
    Federated Learning with Manifold Regularization and Normalized Update Reaggregation. (arXiv:2311.05924v1 [cs.LG])
    Federated Learning (FL) is an emerging collaborative machine learning framework where multiple clients train the global model without sharing their own datasets. In FL, the model inconsistency caused by the local data heterogeneity across clients results in the near-orthogonality of client updates, which leads to the global update norm reduction and slows down the convergence. Most previous works focus on eliminating the difference of parameters (or gradients) between the local and global models, which may fail to reflect the model inconsistency due to the complex structure of the machine learning model and the Euclidean space's limitation in meaningful geometric representations. In this paper, we propose FedMRUR by adopting the manifold model fusion scheme and a new global optimizer to alleviate the negative impacts. Concretely, FedMRUR adopts a hyperbolic graph manifold regularizer enforcing the representations of the data in the local and global models are close to each other in a low-dimensional subspace. Because the machine learning model has the graph structure, the distance in hyperbolic space can reflect the model bias better than the Euclidean distance. In this way, FedMRUR exploits the manifold structures of the representations to significantly reduce the model inconsistency. FedMRUR also aggregates the client updates norms as the global update norm, which can appropriately enlarge each client's contribution to the global update, thereby mitigating the norm reduction introduced by the near-orthogonality of client updates. Furthermore, we theoretically prove that our algorithm can achieve a linear speedup property for non-convex setting under partial client participation.Experiments demonstrate that FedMRUR can achieve a new state-of-the-art (SOTA) accuracy with less communication.
    An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l. (arXiv:2310.07325v3 [cs.LG] UPDATED)
    How do language models deal with the limited bandwidth of the residual stream? Prior work has suggested that some attention heads and MLP layers may perform a "memory management" role. That is, clearing residual stream directions set by earlier layers by reading in information and writing out the negative version. In this work, we present concrete evidence for this phenomenon in a 4-layer transformer. We identify several heads in layer 2 that consistently remove the output of a single layer 0 head. We then verify that this erasure causally depends on the original written direction. We further demonstrate that direct logit attribution (DLA) suggests that writing and erasing heads directly contribute to predictions, when in fact their effects cancel out. Then we present adversarial prompts for which this effect is particularly salient. These findings reveal that memory management can make DLA results misleading. Accordingly, we make concrete recommendations for circuit analysis to prevent interpretability illusions.  ( 2 min )
    An Improved Relaxation for Oracle-Efficient Adversarial Contextual Bandits. (arXiv:2310.19025v2 [cs.LG] UPDATED)
    We present an oracle-efficient relaxation for the adversarial contextual bandits problem, where the contexts are sequentially drawn i.i.d from a known distribution and the cost sequence is chosen by an online adversary. Our algorithm has a regret bound of $O(T^{\frac{2}{3}}(K\log(|\Pi|))^{\frac{1}{3}})$ and makes at most $O(K)$ calls per round to an offline optimization oracle, where $K$ denotes the number of actions, $T$ denotes the number of rounds and $\Pi$ denotes the set of policies. This is the first result to improve the prior best bound of $O((TK)^{\frac{2}{3}}(\log(|\Pi|))^{\frac{1}{3}})$ as obtained by Syrgkanis et al. at NeurIPS 2016, and the first to match the original bound of Langford and Zhang at NeurIPS 2007 which was obtained for the stochastic case.  ( 2 min )
    Independent and Decentralized Learning in Markov Potential Games. (arXiv:2205.14590v6 [cs.LG] UPDATED)
    We propose a multi-agent reinforcement learning dynamics, and analyze its convergence in infinite-horizon discounted Markov potential games. We focus on the independent and decentralized setting, where players do not have knowledge of the game model and cannot coordinate. In each stage, players update their estimate of Q-function that evaluates their total contingent payoff based on the realized one-stage reward in an asynchronous manner. Then, players independently update their policies by incorporating an optimal one-stage deviation strategy based on the estimated Q-function. A key feature of the learning dynamics is that the Q-function estimates are updated at a faster timescale than the policies. We prove that the policies induced by our learning dynamics converge to the set of stationary Nash equilibria in Markov potential games with probability 1. Our results highlight the efficacy of simple learning dynamics in reaching to the set of stationary Nash equilibrium even in environments with minimal information available.  ( 3 min )
    An empirical evaluation of imbalanced data strategies from a practitioner's point of view. (arXiv:1810.07168v2 [cs.LG] UPDATED)
    This paper evaluates six strategies for mitigating imbalanced data: oversampling, undersampling, ensemble methods, specialized algorithms, class weight adjustments, and a no-mitigation approach referred to as the baseline. These strategies were tested on 58 real-life binary imbalanced datasets with imbalance rates ranging from 3 to 120. We conducted a comparative analysis of 10 under-sampling algorithms, 5 over-sampling algorithms, 2 ensemble methods, and 3 specialized algorithms across eight different performance metrics: accuracy, area under the ROC curve (AUC), balanced accuracy, F1-measure, G-mean, Matthew's correlation coefficient, precision, and recall. Additionally, we assessed the six strategies on altered datasets, derived from real-life data, with both low (3) and high (100 or 300) imbalance ratios (IR). The principal finding indicates that the effectiveness of each strategy significantly varies depending on the metric used. The paper also examines a selection of newer algorithms within the categories of specialized algorithms, oversampling, and ensemble methods. The findings suggest that the current hierarchy of best-performing strategies for each metric is unlikely to change with the introduction of newer algorithms.
    HoloNets: Spectral Convolutions do extend to Directed Graphs. (arXiv:2310.02232v2 [cs.LG] UPDATED)
    Within the graph learning community, conventional wisdom dictates that spectral convolutional networks may only be deployed on undirected graphs: Only there could the existence of a well-defined graph Fourier transform be guaranteed, so that information may be translated between spatial- and spectral domains. Here we show this traditional reliance on the graph Fourier transform to be superfluous and -- making use of certain advanced tools from complex analysis and spectral theory -- extend spectral convolutions to directed graphs. We provide a frequency-response interpretation of newly developed filters, investigate the influence of the basis used to express filters and discuss the interplay with characteristic operators on which networks are based. In order to thoroughly test the developed theory, we conduct experiments in real world settings, showcasing that directed spectral convolutional networks provide new state of the art results for heterophilic node classification on many datasets and -- as opposed to baselines -- may be rendered stable to resolution-scale varying topological perturbations.  ( 2 min )
    Improving Robustness and Reliability in Medical Image Classification with Latent-Guided Diffusion and Nested-Ensembles. (arXiv:2310.15952v3 [cs.LG] UPDATED)
    While deep learning models have achieved remarkable success across a range of medical image analysis tasks, deployment of these models in real clinical contexts requires that they be robust to variability in the acquired images. While many methods apply predefined transformations to augment the training data to enhance test-time robustness, these transformations may not ensure the model's robustness to the diverse variability seen in patient images. In this paper, we introduce a novel three-stage approach based on transformers coupled with conditional diffusion models, with the goal of improving model robustness to the kinds of imaging variability commonly encountered in practice without the need for pre-determined data augmentation strategies. To this end, multiple image encoders first learn hierarchical feature representations to build discriminative latent spaces. Next, a reverse diffusion process, guided by the latent code, acts on an informative prior and proposes prediction candidates in a generative manner. Finally, several prediction candidates are aggregated in a bi-level aggregation protocol to produce the final output. Through extensive experiments on medical imaging benchmark datasets, we show that our method improves upon state-of-the-art methods in terms of robustness and confidence calibration. Additionally, we introduce a strategy to quantify the prediction uncertainty at the instance level, increasing their trustworthiness to clinicians using them in clinical practice.  ( 3 min )
    Doubly Robust Structure Identification from Temporal Data. (arXiv:2311.06012v1 [cs.LG])
    Learning the causes of time-series data is a fundamental task in many applications, spanning from finance to earth sciences or bio-medical applications. Common approaches for this task are based on vector auto-regression, and they do not take into account unknown confounding between potential causes. However, in settings with many potential causes and noisy data, these approaches may be substantially biased. Furthermore, potential causes may be correlated in practical applications. Moreover, existing algorithms often do not work with cyclic data. To address these challenges, we propose a new doubly robust method for Structure Identification from Temporal Data ( SITD ). We provide theoretical guarantees, showing that our method asymptotically recovers the true underlying causal structure. Our analysis extends to cases where the potential causes have cycles and they may be confounded. We further perform extensive experiments to showcase the superior performance of our method.
    Optimal Regret Is Achievable with Bounded Approximate Inference Error: An Enhanced Bayesian Upper Confidence Bound Framework. (arXiv:2201.12955v4 [cs.LG] UPDATED)
    Bayesian bandit algorithms with approximate Bayesian inference have been widely used in real-world applications. However, there is a large discrepancy between the superior practical performance of these approaches and their theoretical justification. Previous research only indicates a negative theoretical result: Thompson sampling could have a worst-case linear regret $\Omega(T)$ with a constant threshold on the inference error measured by one $\alpha$-divergence. To bridge this gap, we propose an Enhanced Bayesian Upper Confidence Bound (EBUCB) framework that can efficiently accommodate bandit problems in the presence of approximate inference. Our theoretical analysis demonstrates that for Bernoulli multi-armed bandits, EBUCB can achieve the optimal regret order $O(\log T)$ if the inference error measured by two different $\alpha$-divergences is less than a constant, regardless of how large this constant is. To our best knowledge, our study provides the first theoretical regret bound that is better than $o(T)$ in the setting of constant approximate inference error. Furthermore, in concordance with the negative results in previous studies, we show that only one bounded $\alpha$-divergence is insufficient to guarantee a sub-linear regret.  ( 3 min )
    Learning Decentralized Linear Quadratic Regulator with $\sqrt{T}$ Regret. (arXiv:2210.08886v2 [math.OC] UPDATED)
    We study the problem of learning decentralized linear quadratic regulator when the system model is unknown a priori. We propose an online learning algorithm that adaptively designs a control policy as new data samples from a single system trajectory become available. Our algorithm design uses a disturbance-feedback representation of state-feedback controllers coupled with online convex optimization with memory and delayed feedback. We show that our controller enjoys an expected regret that scales as $\sqrt{T}$ with the time horizon $T$ for the case of partially nested information pattern. For more general information patterns, the optimal controller is unknown even if the system model is known. In this case, the regret of our controller is shown with respect to a linear sub-optimal controller. We validate our theoretical findings using numerical experiments.  ( 2 min )
    SANSformers: Self-Supervised Forecasting in Electronic Health Records with Attention-Free Models. (arXiv:2108.13672v4 [cs.LG] UPDATED)
    Despite the proven effectiveness of Transformer neural networks across multiple domains, their performance with Electronic Health Records (EHR) can be nuanced. The unique, multidimensional sequential nature of EHR data can sometimes make even simple linear models with carefully engineered features more competitive. Thus, the advantages of Transformers, such as efficient transfer learning and improved scalability are not always fully exploited in EHR applications. Addressing these challenges, we introduce SANSformer, an attention-free sequential model designed with specific inductive biases to cater for the unique characteristics of EHR data. In this work, we aim to forecast the demand for healthcare services, by predicting the number of patient visits to healthcare facilities. The challenge amplifies when dealing with divergent patient subgroups, like those with rare diseases, which are characterized by unique health trajectories and are typically smaller in size. To address this, we employ a self-supervised pretraining strategy, Generative Summary Pretraining (GSP), which predicts future summary statistics based on past health records of a patient. Our models are pretrained on a health registry of nearly one million patients, then fine-tuned for specific subgroup prediction tasks, showcasing the potential to handle the multifaceted nature of EHR data. In evaluation, SANSformer consistently surpasses robust EHR baselines, with our GSP pretraining method notably amplifying model performance, particularly within smaller patient subgroups. Our results illuminate the promising potential of tailored attention-free models and self-supervised pretraining in refining healthcare utilization predictions across various patient demographics.  ( 3 min )
    Semantic Map Guided Synthesis of Wireless Capsule Endoscopy Images using Diffusion Models. (arXiv:2311.05889v1 [eess.IV])
    Wireless capsule endoscopy (WCE) is a non-invasive method for visualizing the gastrointestinal (GI) tract, crucial for diagnosing GI tract diseases. However, interpreting WCE results can be time-consuming and tiring. Existing studies have employed deep neural networks (DNNs) for automatic GI tract lesion detection, but acquiring sufficient training examples, particularly due to privacy concerns, remains a challenge. Public WCE databases lack diversity and quantity. To address this, we propose a novel approach leveraging generative models, specifically the diffusion model (DM), for generating diverse WCE images. Our model incorporates semantic map resulted from visualization scale (VS) engine, enhancing the controllability and diversity of generated images. We evaluate our approach using visual inspection and visual Turing tests, demonstrating its effectiveness in generating realistic and diverse WCE images.
    Primal Estimated Subgradient Solver for SVM for Imbalanced Classification. (arXiv:2206.09311v6 [cs.LG] UPDATED)
    We aim to demonstrate in experiments that our cost sensitive PEGASOS SVM achieves good performance on imbalanced data sets with a Majority to Minority Ratio ranging from 8.6:1 to 130:1 and to ascertain whether the including intercept (bias), regularization and parameters affects performance on our selection of datasets. Although many resort to SMOTE methods, we aim for a less computationally intensive method. We evaluate the performance by examining the learning curves. These curves diagnose whether we overfit or underfit or whether the random sample of data chosen during the process was not random enough or diverse enough in dependent variable class for the algorithm to generalized to unseen examples. We will also see the background of the hyperparameters versus the test and train error in validation curves. We benchmark our PEGASOS Cost-Sensitive SVM's results of Ding's LINEAR SVM DECIDL method. He obtained an ROC-AUC of .5 in one dataset. Our work will extend the work of Ding by incorporating kernels into SVM. We will use Python rather than MATLAB as python has dictionaries for storing mixed data types during multi-parameter cross-validation.  ( 2 min )
    Turbulence Scaling from Deep Learning Diffusion Generative Models. (arXiv:2311.06112v1 [physics.flu-dyn])
    Complex spatial and temporal structures are inherent characteristics of turbulent fluid flows and comprehending them poses a major challenge. This comprehesion necessitates an understanding of the space of turbulent fluid flow configurations. We employ a diffusion-based generative model to learn the distribution of turbulent vorticity profiles and generate snapshots of turbulent solutions to the incompressible Navier-Stokes equations. We consider the inverse cascade in two spatial dimensions and generate diverse turbulent solutions that differ from those in the training dataset. We analyze the statistical scaling properties of the new turbulent profiles, calculate their structure functions, energy power spectrum, velocity probability distribution function and moments of local energy dissipation. All the learnt scaling exponents are consistent with the expected Kolmogorov scaling and have lower errors than the training ones. This agreement with established turbulence characteristics provides strong evidence of the model's capability to capture essential features of real-world turbulence.  ( 2 min )
    Rethinking Semi-Supervised Imbalanced Node Classification from Bias-Variance Decomposition. (arXiv:2310.18765v2 [cs.LG] UPDATED)
    This paper introduces a new approach to address the issue of class imbalance in graph neural networks (GNNs) for learning on graph-structured data. Our approach integrates imbalanced node classification and Bias-Variance Decomposition, establishing a theoretical framework that closely relates data imbalance to model variance. We also leverage graph augmentation technique to estimate the variance, and design a regularization term to alleviate the impact of imbalance. Exhaustive tests are conducted on multiple benchmarks, including naturally imbalanced datasets and public-split class-imbalanced datasets, demonstrating that our approach outperforms state-of-the-art methods in various imbalanced scenarios. This work provides a novel theoretical perspective for addressing the problem of imbalanced node classification in GNNs.  ( 2 min )
    Score Normalization for a Faster Diffusion Exponential Integrator Sampler. (arXiv:2311.00157v2 [cs.LG] UPDATED)
    Recently, Zhang et al. have proposed the Diffusion Exponential Integrator Sampler (DEIS) for fast generation of samples from Diffusion Models. It leverages the semi-linear nature of the probability flow ordinary differential equation (ODE) in order to greatly reduce integration error and improve generation quality at low numbers of function evaluations (NFEs). Key to this approach is the score function reparameterisation, which reduces the integration error incurred from using a fixed score function estimate over each integration step. The original authors use the default parameterisation used by models trained for noise prediction -- multiply the score by the standard deviation of the conditional forward noising distribution. We find that although the mean absolute value of this score parameterisation is close to constant for a large portion of the reverse sampling process, it changes rapidly at the end of sampling. As a simple fix, we propose to instead reparameterise the score (at inference) by dividing it by the average absolute value of previous score estimates at that time step collected from offline high NFE generations. We find that our score normalisation (DEIS-SN) consistently improves FID compared to vanilla DEIS, showing an improvement at 10 NFEs from 6.44 to 5.57 on CIFAR-10 and from 5.9 to 4.95 on LSUN-Church 64x64. Our code is available at https://github.com/mtkresearch/Diffusion-DEIS-SN  ( 3 min )
    What Distributions are Robust to Indiscriminate Poisoning Attacks for Linear Learners?. (arXiv:2307.01073v2 [cs.LG] UPDATED)
    We study indiscriminate poisoning for linear learners where an adversary injects a few crafted examples into the training data with the goal of forcing the induced model to incur higher test error. Inspired by the observation that linear learners on some datasets are able to resist the best known attacks even without any defenses, we further investigate whether datasets can be inherently robust to indiscriminate poisoning attacks for linear learners. For theoretical Gaussian distributions, we rigorously characterize the behavior of an optimal poisoning attack, defined as the poisoning strategy that attains the maximum risk of the induced model at a given poisoning budget. Our results prove that linear learners can indeed be robust to indiscriminate poisoning if the class-wise data distributions are well-separated with low variance and the size of the constraint set containing all permissible poisoning points is also small. These findings largely explain the drastic variation in empirical attack performance of the state-of-the-art poisoning attacks on linear learners across benchmark datasets, making an important initial step towards understanding the underlying reasons some learning tasks are vulnerable to data poisoning attacks.  ( 2 min )
    Efficient Uncertainty Quantification and Reduction for Over-Parameterized Neural Networks. (arXiv:2306.05674v2 [stat.ML] UPDATED)
    Uncertainty quantification (UQ) is important for reliability assessment and enhancement of machine learning models. In deep learning, uncertainties arise not only from data, but also from the training procedure that often injects substantial noises and biases. These hinder the attainment of statistical guarantees and, moreover, impose computational challenges on UQ due to the need for repeated network retraining. Building upon the recent neural tangent kernel theory, we create statistically guaranteed schemes to principally \emph{characterize}, and \emph{remove}, the uncertainty of over-parameterized neural networks with very low computation effort. In particular, our approach, based on what we call a procedural-noise-correcting (PNC) predictor, removes the procedural uncertainty by using only \emph{one} auxiliary network that is trained on a suitably labeled dataset, instead of many retrained networks employed in deep ensembles. Moreover, by combining our PNC predictor with suitable light-computation resampling methods, we build several approaches to construct asymptotically exact-coverage confidence intervals using as low as four trained networks without additional overheads.  ( 2 min )
    Harnessing Synthetic Datasets: The Role of Shape Bias in Deep Neural Network Generalization. (arXiv:2311.06224v1 [cs.CV])
    Recent advancements in deep learning have been primarily driven by the use of large models trained on increasingly vast datasets. While neural scaling laws have emerged to predict network performance given a specific level of computational resources, the growing demand for expansive datasets raises concerns. To address this, a new research direction has emerged, focusing on the creation of synthetic data as a substitute. In this study, we investigate how neural networks exhibit shape bias during training on synthetic datasets, serving as an indicator of the synthetic data quality. Specifically, our findings indicate three key points: (1) Shape bias varies across network architectures and types of supervision, casting doubt on its reliability as a predictor for generalization and its ability to explain differences in model recognition compared to human capabilities. (2) Relying solely on shape bias to estimate generalization is unreliable, as it is entangled with diversity and naturalism. (3) We propose a novel interpretation of shape bias as a tool for estimating the diversity of samples within a dataset. Our research aims to clarify the implications of using synthetic data and its associated shape bias in deep learning, addressing concerns regarding generalization and dataset quality.
    Bag of Tricks for Fully Test-Time Adaptation. (arXiv:2310.02416v2 [cs.LG] UPDATED)
    Fully Test-Time Adaptation (TTA), which aims at adapting models to data drifts, has recently attracted wide interest. Numerous tricks and techniques have been proposed to ensure robust learning on arbitrary streams of unlabeled data. However, assessing the true impact of each individual technique and obtaining a fair comparison still constitutes a significant challenge. To help consolidate the community's knowledge, we present a categorization of selected orthogonal TTA techniques, including small batch normalization, stream rebalancing, reliable sample selection, and network confidence calibration. We meticulously dissect the effect of each approach on different scenarios of interest. Through our analysis, we shed light on trade-offs induced by those techniques between accuracy, the computational power required, and model complexity. We also uncover the synergy that arises when combining techniques and are able to establish new state-of-the-art results.  ( 2 min )
    Parameterized Convex Minorant for Objective Function Approximation in Amortized Optimization. (arXiv:2310.02519v3 [cs.LG] UPDATED)
    Parameterized convex minorant (PCM) method is proposed for the approximation of the objective function in amortized optimization. In the proposed method, the objective function approximator is expressed by the sum of a PCM and a nonnegative gap function, where the objective function approximator is bounded from below by the PCM convex in the optimization variable. The proposed objective function approximator is a universal approximator for continuous functions, and the global minimizer of the PCM attains the global minimum of the objective function approximator. Therefore, the global minimizer of the objective function approximator can be obtained by a single convex optimization. As a realization of the proposed method, extended parameterized log-sum-exp network is proposed by utilizing a parameterized log-sum-exp network as the PCM. Numerical simulation is performed for parameterized non-convex objective function approximation and for learning-based nonlinear model predictive control to demonstrate the performance and characteristics of the proposed method. The simulation results support that the proposed method can be used to learn objective functions and to find a global minimizer reliably and quickly by using convex optimization algorithms.  ( 2 min )
    Wasserstein Auto-Encoders of Merge Trees (and Persistence Diagrams). (arXiv:2307.02509v2 [cs.LG] UPDATED)
    This paper presents a computational framework for the Wasserstein auto-encoding of merge trees (MT-WAE), a novel extension of the classical auto-encoder neural network architecture to the Wasserstein metric space of merge trees. In contrast to traditional auto-encoders which operate on vectorized data, our formulation explicitly manipulates merge trees on their associated metric space at each layer of the network, resulting in superior accuracy and interpretability. Our novel neural network approach can be interpreted as a non-linear generalization of previous linear attempts [79] at merge tree encoding. It also trivially extends to persistence diagrams. Extensive experiments on public ensembles demonstrate the efficiency of our algorithms, with MT-WAE computations in the orders of minutes on average. We show the utility of our contributions in two applications adapted from previous work on merge tree encoding [79]. First, we apply MT-WAE to merge tree compression, by concisely representing them with their coordinates in the final layer of our auto-encoder. Second, we document an application to dimensionality reduction, by exploiting the latent space of our auto-encoder, for the visual analysis of ensemble data. We illustrate the versatility of our framework by introducing two penalty terms, to help preserve in the latent space both the Wasserstein distances between merge trees, as well as their clusters. In both applications, quantitative experiments assess the relevance of our framework. Finally, we provide a C++ implementation that can be used for reproducibility.  ( 3 min )
    Outage-Watch: Early Prediction of Outages using Extreme Event Regularizer. (arXiv:2309.17340v2 [cs.DC] UPDATED)
    Cloud services are omnipresent and critical cloud service failure is a fact of life. In order to retain customers and prevent revenue loss, it is important to provide high reliability guarantees for these services. One way to do this is by predicting outages in advance, which can help in reducing the severity as well as time to recovery. It is difficult to forecast critical failures due to the rarity of these events. Moreover, critical failures are ill-defined in terms of observable data. Our proposed method, Outage-Watch, defines critical service outages as deteriorations in the Quality of Service (QoS) captured by a set of metrics. Outage-Watch detects such outages in advance by using current system state to predict whether the QoS metrics will cross a threshold and initiate an extreme event. A mixture of Gaussian is used to model the distribution of the QoS metrics for flexibility and an extreme event regularizer helps in improving learning in tail of the distribution. An outage is predicted if the probability of any one of the QoS metrics crossing threshold changes significantly. Our evaluation on a real-world SaaS company dataset shows that Outage-Watch significantly outperforms traditional methods with an average AUC of 0.98. Additionally, Outage-Watch detects all the outages exhibiting a change in service metrics and reduces the Mean Time To Detection (MTTD) of outages by up to 88% when deployed in an enterprise cloud-service system, demonstrating efficacy of our proposed method.  ( 3 min )
    CausalImages: An R Package for Causal Inference with Earth Observation, Bio-medical, and Social Science Images. (arXiv:2310.00233v3 [cs.LG] UPDATED)
    The causalimages R package enables causal inference with image and image sequence data, providing new tools for integrating novel data sources like satellite and bio-medical imagery into the study of cause and effect. One set of functions enables image-based causal inference analyses. For example, one key function decomposes treatment effect heterogeneity by images using an interpretable Bayesian framework. This allows for determining which types of images or image sequences are most responsive to interventions. A second modeling function allows researchers to control for confounding using images. The package also allows investigators to produce embeddings that serve as vector summaries of the image or video content. Finally, infrastructural functions are also provided, such as tools for writing large-scale image and image sequence data as sequentialized byte strings for more rapid image analysis. causalimages therefore opens new capabilities for causal inference in R, letting researchers use informative imagery in substantive analyses in a fast and accessible manner.  ( 2 min )
    Conversational Financial Information Retrieval Model (ConFIRM). (arXiv:2310.13001v2 [cs.IR] UPDATED)
    With the exponential growth in large language models (LLMs), leveraging their emergent properties for specialized domains like finance merits exploration. However, regulated fields such as finance pose unique constraints, requiring domain-optimized frameworks. We present ConFIRM, an LLM-based conversational financial information retrieval model tailored for query intent classification and knowledge base labeling. ConFIRM comprises two modules: 1) a method to synthesize finance domain-specific question-answer pairs, and 2) evaluation of parameter efficient fine-tuning approaches for the query classification task. We generate a dataset of over 4000 samples, assessing accuracy on a separate test set. ConFIRM achieved over 90% accuracy, essential for regulatory compliance. ConFIRM provides a data-efficient solution to extract precise query intent for financial dialog systems.  ( 2 min )
    Statistical Guarantees for Variational Autoencoders using PAC-Bayesian Theory. (arXiv:2310.04935v2 [cs.LG] UPDATED)
    Since their inception, Variational Autoencoders (VAEs) have become central in machine learning. Despite their widespread use, numerous questions regarding their theoretical properties remain open. Using PAC-Bayesian theory, this work develops statistical guarantees for VAEs. First, we derive the first PAC-Bayesian bound for posterior distributions conditioned on individual samples from the data-generating distribution. Then, we utilize this result to develop generalization guarantees for the VAE's reconstruction loss, as well as upper bounds on the distance between the input and the regenerated distributions. More importantly, we provide upper bounds on the Wasserstein distance between the input distribution and the distribution defined by the VAE's generative model.  ( 2 min )
    Pretrained deep models outperform GBDTs in Learning-To-Rank under label scarcity. (arXiv:2308.00177v2 [cs.LG] UPDATED)
    While deep learning (DL) models are state-of-the-art in text and image domains, they have not yet consistently outperformed Gradient Boosted Decision Trees (GBDTs) on tabular Learning-To-Rank (LTR) problems. Most of the recent performance gains attained by DL models in text and image tasks have used unsupervised pretraining, which exploits orders of magnitude more unlabeled data than labeled data. To the best of our knowledge, unsupervised pretraining has not been applied to the LTR problem, which often produces vast amounts of unlabeled data. In this work, we study whether unsupervised pretraining of deep models can improve LTR performance over GBDTs and other non-pretrained models. By incorporating simple design choices--including SimCLR-Rank, an LTR-specific pretraining loss--we produce pretrained deep learning models that consistently (across datasets) outperform GBDTs (and other non-pretrained rankers) in the case where there is more unlabeled data than labeled data. This performance improvement occurs not only on average but also on outlier queries. We base our empirical conclusions off of experiments on (1) public benchmark tabular LTR datasets, and (2) a large industry-scale proprietary ranking dataset. Code is provided at https://anonymous.4open.science/r/ltr-pretrain-0DAD/README.md.  ( 2 min )
    Dynamic Sparsity Is Channel-Level Sparsity Learner. (arXiv:2305.19454v2 [cs.LG] UPDATED)
    Sparse training has received an upsurging interest in machine learning due to its tantalizing saving potential for the entire training process as well as inference. Dynamic sparse training (DST), as a leading sparse training approach, can train deep neural networks at high sparsity from scratch to match the performance of their dense counterparts. However, most if not all DST prior arts demonstrate their effectiveness on unstructured sparsity with highly irregular sparse patterns, which receives limited support in common hardware. This limitation hinders the usage of DST in practice. In this paper, we propose Channel-aware dynamic sparse (Chase), which for the first time seamlessly translates the promise of unstructured dynamic sparsity to GPU-friendly channel-level sparsity (not fine-grained N:M or group sparsity) during one end-to-end training process, without any ad-hoc operations. The resulting small sparse networks can be directly accelerated by commodity hardware, without using any particularly sparsity-aware hardware accelerators. This appealing outcome is partially motivated by a hidden phenomenon of dynamic sparsity: off-the-shelf unstructured DST implicitly involves biased parameter reallocation across channels, with a large fraction of channels (up to 60%) being sparser than others. By progressively identifying and removing these channels during training, our approach translates unstructured sparsity to channel-wise sparsity. Our experimental results demonstrate that Chase achieves 1.7 X inference throughput speedup on common GPU devices without compromising accuracy with ResNet-50 on ImageNet. We release our codes in https://github.com/luuyin/chase.  ( 3 min )
    Neuro-Inspired Hierarchical Multimodal Learning. (arXiv:2309.15877v2 [cs.LG] UPDATED)
    Integrating and processing information from various sources or modalities are critical for obtaining a comprehensive and accurate perception of the real world. Drawing inspiration from neuroscience, we develop the Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the concept of information bottleneck. Distinct from most traditional fusion models that aim to incorporate all modalities as input, our model designates the prime modality as input, while the remaining modalities act as detectors in the information pathway. Our proposed perception model focuses on constructing an effective and compact information flow by achieving a balance between the minimization of mutual information between the latent state and the input modal state, and the maximization of mutual information between the latent states and the remaining modal states. This approach leads to compact latent state representations that retain relevant information while minimizing redundancy, thereby substantially enhancing the performance of downstream tasks. Experimental evaluations on both the MUStARD and CMU-MOSI datasets demonstrate that our model consistently distills crucial information in multimodal learning scenarios, outperforming state-of-the-art benchmarks.  ( 2 min )
    State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding. (arXiv:2309.12482v2 [cs.LG] UPDATED)
    As more non-AI experts use complex AI systems for daily tasks, there has been an increasing effort to develop methods that produce explanations of AI decision making that are understandable by non-AI experts. Towards this effort, leveraging higher-level concepts and producing concept-based explanations have become a popular method. Most concept-based explanations have been developed for classification techniques, and we posit that the few existing methods for sequential decision making are limited in scope. In this work, we first contribute a desiderata for defining concepts in sequential decision making settings. Additionally, inspired by the Protege Effect which states explaining knowledge often reinforces one's self-learning, we explore how concept-based explanations of an RL agent's decision making can in turn improve the agent's learning rate, as well as improve end-user understanding of the agent's decision making. To this end, we contribute a unified framework, State2Explanation (S2E), that involves learning a joint embedding model between state-action pairs and concept-based explanations, and leveraging such learned model to both (1) inform reward shaping during an agent's training, and (2) provide explanations to end-users at deployment for improved task performance. Our experimental validations, in Connect 4 and Lunar Lander, demonstrate the success of S2E in providing a dual-benefit, successfully informing reward shaping and improving agent learning rate, as well as significantly improving end user task performance at deployment time.  ( 3 min )
    Stabilizing Estimates of Shapley Values with Control Variates. (arXiv:2310.07672v2 [stat.ML] UPDATED)
    Shapley values are among the most popular tools for explaining predictions of blackbox machine learning models. However, their high computational cost motivates the use of sampling approximations, inducing a considerable degree of uncertainty. To stabilize these model explanations, we propose ControlSHAP, an approach based on the Monte Carlo technique of control variates. Our methodology is applicable to any machine learning model and requires virtually no extra computation or modeling effort. On several high-dimensional datasets, we find it can produce dramatic reductions in the Monte Carlo variability of Shapley estimates.  ( 2 min )
    Distribution-Informed Adaptation for kNN Graph Construction. (arXiv:2308.02442v3 [cs.LG] UPDATED)
    Graph-based kNN algorithms have garnered widespread popularity for machine learning tasks due to their simplicity and effectiveness. However, as factual data often inherit complex distributions, the conventional kNN graph's reliance on a unified k-value can hinder its performance. A crucial factor behind this challenge is the presence of ambiguous samples along decision boundaries that are inevitably more prone to incorrect classifications. To address the situation, we propose the Distribution-Informed adaptive kNN Graph (DaNNG), which combines adaptive kNN with distribution-aware graph construction. By incorporating an approximation of the distribution with customized k-adaption criteria, DaNNG can significantly improve performance on ambiguous samples, and hence enhance overall accuracy and generalization capability. Through rigorous evaluations on diverse benchmark datasets, DaNNG outperforms state-of-the-art algorithms, showcasing its adaptability and efficacy across various real-world scenarios.  ( 2 min )
    Aggregation Weighting of Federated Learning via Generalization Bound Estimation. (arXiv:2311.05936v1 [cs.LG])
    Federated Learning (FL) typically aggregates client model parameters using a weighting approach determined by sample proportions. However, this naive weighting method may lead to unfairness and degradation in model performance due to statistical heterogeneity and the inclusion of noisy data among clients. Theoretically, distributional robustness analysis has shown that the generalization performance of a learning model with respect to any shifted distribution is bounded. This motivates us to reconsider the weighting approach in federated learning. In this paper, we replace the aforementioned weighting method with a new strategy that considers the generalization bounds of each local model. Specifically, we estimate the upper and lower bounds of the second-order origin moment of the shifted distribution for the current local model, and then use these bounds disagreements as the aggregation proportions for weightings in each communication round. Experiments demonstrate that the proposed weighting strategy significantly improves the performance of several representative FL algorithms on benchmark datasets.  ( 2 min )
    Beyond expectations: Residual Dynamic Mode Decomposition and Variance for Stochastic Dynamical Systems. (arXiv:2308.10697v3 [math.DS] UPDATED)
    Koopman operators linearize nonlinear dynamical systems, making their spectral information of crucial interest. Numerous algorithms have been developed to approximate these spectral properties, and Dynamic Mode Decomposition (DMD) stands out as the poster child of projection-based methods. Although the Koopman operator itself is linear, the fact that it acts in an infinite-dimensional space of observables poses challenges. These include spurious modes, essential spectra, and the verification of Koopman mode decompositions. While recent work has addressed these challenges for deterministic systems, there remains a notable gap in verified DMD methods for stochastic systems, where the Koopman operator measures the expectation of observables. We show that it is necessary to go beyond expectations to address these issues. By incorporating variance into the Koopman framework, we address these challenges. Through an additional DMD-type matrix, we approximate the sum of a squared residual and a variance term, each of which can be approximated individually using batched snapshot data. This allows verified computation of the spectral properties of stochastic Koopman operators, controlling the projection error. We also introduce the concept of variance-pseudospectra to gauge statistical coherency. Finally, we present a suite of convergence results for the spectral information of stochastic Koopman operators. Our study concludes with practical applications using both simulated and experimental data. In neural recordings from awake mice, we demonstrate how variance-pseudospectra can reveal physiologically significant information unavailable to standard expectation-based dynamical models.  ( 3 min )
    Robust Adversarial Attacks Detection for Deep Learning based Relative Pose Estimation for Space Rendezvous. (arXiv:2311.05992v1 [cs.CV])
    Research on developing deep learning techniques for autonomous spacecraft relative navigation challenges is continuously growing in recent years. Adopting those techniques offers enhanced performance. However, such approaches also introduce heightened apprehensions regarding the trustability and security of such deep learning methods through their susceptibility to adversarial attacks. In this work, we propose a novel approach for adversarial attack detection for deep neural network-based relative pose estimation schemes based on the explainability concept. We develop for an orbital rendezvous scenario an innovative relative pose estimation technique adopting our proposed Convolutional Neural Network (CNN), which takes an image from the chaser's onboard camera and outputs accurately the target's relative position and rotation. We perturb seamlessly the input images using adversarial attacks that are generated by the Fast Gradient Sign Method (FGSM). The adversarial attack detector is then built based on a Long Short Term Memory (LSTM) network which takes the explainability measure namely SHapley Value from the CNN-based pose estimator and flags the detection of adversarial attacks when acting. Simulation results show that the proposed adversarial attack detector achieves a detection accuracy of 99.21%. Both the deep relative pose estimator and adversarial attack detector are then tested on real data captured from our laboratory-designed setup. The experimental results from our laboratory-designed setup demonstrate that the proposed adversarial attack detector achieves an average detection accuracy of 96.29%.  ( 3 min )
    StrainTensorNet: Predicting crystal structure elastic properties using SE(3)-equivariant graph neural networks. (arXiv:2306.12818v2 [cond-mat.dis-nn] UPDATED)
    Accurately predicting the elastic properties of crystalline solids is vital for computational materials science. However, traditional atomistic scale ab initio approaches are computationally intensive, especially for studying complex materials with a large number of atoms in a unit cell. We introduce a novel data-driven approach to efficiently predict the elastic properties of crystal structures using SE(3)-equivariant graph neural networks (GNNs). This approach yields important scalar elastic moduli with the accuracy comparable to recent data-driven studies. Importantly, our symmetry-aware GNNs model also enables the prediction of the strain energy density (SED) and the associated elastic constants, the fundamental tensorial quantities that are significantly influenced by a material's crystallographic group. The model consistently distinguishes independent elements of SED tensors, in accordance with the symmetry of the crystal structures. Finally, our deep learning model possesses meaningful latent features, offering an interpretable prediction of the elastic properties.  ( 2 min )
    Deep quantum neural networks form Gaussian processes. (arXiv:2305.09957v2 [quant-ph] UPDATED)
    It is well known that artificial neural networks initialized from independent and identically distributed priors converge to Gaussian processes in the limit of large number of neurons per hidden layer. In this work we prove an analogous result for Quantum Neural Networks (QNNs). Namely, we show that the outputs of certain models based on Haar random unitary or orthogonal deep QNNs converge to Gaussian processes in the limit of large Hilbert space dimension $d$. The derivation of this result is more nuanced than in the classical case due to the role played by the input states, the measurement observable, and the fact that the entries of unitary matrices are not independent. An important consequence of our analysis is that the ensuing Gaussian processes cannot be used to efficiently predict the outputs of the QNN via Bayesian statistics. Furthermore, our theorems imply that the concentration of measure phenomenon in Haar random QNNs is worse than previously thought, as we prove that expectation values and gradients concentrate as $\mathcal{O}\left(\frac{1}{e^d \sqrt{d}}\right)$. Finally, we discuss how our results improve our understanding of concentration in $t$-designs.  ( 2 min )
    Adapting Fairness Interventions to Missing Values. (arXiv:2305.19429v2 [cs.LG] UPDATED)
    Missing values in real-world data pose a significant and unique challenge to algorithmic fairness. Different demographic groups may be unequally affected by missing data, and the standard procedure for handling missing values where first data is imputed, then the imputed data is used for classification -- a procedure referred to as "impute-then-classify" -- can exacerbate discrimination. In this paper, we analyze how missing values affect algorithmic fairness. We first prove that training a classifier from imputed data can significantly worsen the achievable values of group fairness and average accuracy. This is because imputing data results in the loss of the missing pattern of the data, which often conveys information about the predictive label. We present scalable and adaptive algorithms for fair classification with missing values. These algorithms can be combined with any preexisting fairness-intervention algorithm to handle all possible missing patterns while preserving information encoded within the missing patterns. Numerical experiments with state-of-the-art fairness interventions demonstrate that our adaptive algorithms consistently achieve higher fairness and accuracy than impute-then-classify across different datasets.  ( 2 min )
    Hiformer: Heterogeneous Feature Interactions Learning with Transformers for Recommender Systems. (arXiv:2311.05884v1 [cs.IR])
    Learning feature interaction is the critical backbone to building recommender systems. In web-scale applications, learning feature interaction is extremely challenging due to the sparse and large input feature space; meanwhile, manually crafting effective feature interactions is infeasible because of the exponential solution space. We propose to leverage a Transformer-based architecture with attention layers to automatically capture feature interactions. Transformer architectures have witnessed great success in many domains, such as natural language processing and computer vision. However, there has not been much adoption of Transformer architecture for feature interaction modeling in industry. We aim at closing the gap. We identify two key challenges for applying the vanilla Transformer architecture to web-scale recommender systems: (1) Transformer architecture fails to capture the heterogeneous feature interactions in the self-attention layer; (2) The serving latency of Transformer architecture might be too high to be deployed in web-scale recommender systems. We first propose a heterogeneous self-attention layer, which is a simple yet effective modification to the self-attention layer in Transformer, to take into account the heterogeneity of feature interactions. We then introduce \textsc{Hiformer} (\textbf{H}eterogeneous \textbf{I}nteraction Trans\textbf{former}) to further improve the model expressiveness. With low-rank approximation and model pruning, \hiformer enjoys fast inference for online deployment. Extensive offline experiment results corroborates the effectiveness and efficiency of the \textsc{Hiformer} model. We have successfully deployed the \textsc{Hiformer} model to a real world large scale App ranking model at Google Play, with significant improvement in key engagement metrics (up to +2.66\%).  ( 3 min )
    Tamil-Llama: A New Tamil Language Model Based on Llama 2. (arXiv:2311.05845v1 [cs.CL])
    Language modeling has witnessed remarkable advancements in recent years, with Large Language Models (LLMs) like ChatGPT setting unparalleled benchmarks in human-like text generation. However, a prevailing limitation is the underrepresentation of languages like Tamil in these cutting-edge models, leading to suboptimal performance in diverse linguistic contexts. This paper addresses this lacuna, enhancing the open-source LLaMA model with an addition of 16,000 Tamil tokens, aiming to achieve superior text generation and comprehension in the Tamil language. We strategically employ the LoRA methodology for efficient model training on a comprehensive Tamil corpus, ensuring computational feasibility and model robustness. Moreover, we introduce a Tamil-translated version of the Alpaca dataset and a subset of the OpenOrca dataset tailored for instruction fine-tuning. Our results showcase significant performance improvements in Tamil text generation, with potential implications for the broader landscape of LLMs in Indian languages. We further underscore our commitment to open research by making our models, datasets, and code publicly accessible, fostering further innovations in language modeling.  ( 2 min )
    Layer-wise Auto-Weighting for Non-Stationary Test-Time Adaptation. (arXiv:2311.05858v1 [cs.LG])
    Given the inevitability of domain shifts during inference in real-world applications, test-time adaptation (TTA) is essential for model adaptation after deployment. However, the real-world scenario of continuously changing target distributions presents challenges including catastrophic forgetting and error accumulation. Existing TTA methods for non-stationary domain shifts, while effective, incur excessive computational load, making them impractical for on-device settings. In this paper, we introduce a layer-wise auto-weighting algorithm for continual and gradual TTA that autonomously identifies layers for preservation or concentrated adaptation. By leveraging the Fisher Information Matrix (FIM), we first design the learning weight to selectively focus on layers associated with log-likelihood changes while preserving unrelated ones. Then, we further propose an exponential min-max scaler to make certain layers nearly frozen while mitigating outliers. This minimizes forgetting and error accumulation, leading to efficient adaptation to non-stationary target distribution. Experiments on CIFAR-10C, CIFAR-100C, and ImageNet-C show our method outperforms conventional continual and gradual TTA approaches while significantly reducing computational load, highlighting the importance of FIM-based learning weight in adapting to continuously or gradually shifting target domains.  ( 2 min )
    Synthesize High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model. (arXiv:2304.02169v3 [cs.LG] UPDATED)
    Synthetic electronic health records (EHRs) that are both realistic and preserve privacy can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis. However, generating high-fidelity and granular electronic health record (EHR) data in its original, highly-dimensional form poses challenges for existing methods due to the complexities inherent in high-dimensional data. In this paper, we propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR, which preserve the statistical properties of real EHR and can be used to train accurate ML models without privacy concerns. Our HALO method, designed as a hierarchical autoregressive model, generates a probability density function of medical codes, clinical visits, and patient records, allowing for the generation of realistic EHR data in its original, unaggregated form without the need for variable selection or aggregation. Additionally, our model also produces high-quality continuous variables in a longitudinal and probabilistic manner. We conducted extensive experiments and demonstrate that HALO can generate high-fidelity EHR data with high-dimensional disease code probabilities (d > 10,000), disease co-occurrence probabilities within visits (d > 1,000,000), and conditional probabilities across consecutive visits (d > 5,000,000) and achieve above 0.9 R2 correlation in comparison to real EHR data. This performance then enables downstream ML models trained on its synthetic data to achieve comparable accuracy to models trained on real data (0.938 AUROC with HALO data vs. 0.943 with real data). Finally, using a combination of real and synthetic data enhances the accuracy of ML models beyond that achieved by using only real EHR data.  ( 3 min )
    Hierarchical deep learning-based adaptive time-stepping scheme for multiscale simulations. (arXiv:2311.05961v1 [math.DS])
    Multiscale is a hallmark feature of complex nonlinear systems. While the simulation using the classical numerical methods is restricted by the local \textit{Taylor} series constraints, the multiscale techniques are often limited by finding heuristic closures. This study proposes a new method for simulating multiscale problems using deep neural networks. By leveraging the hierarchical learning of neural network time steppers, the method adapts time steps to approximate dynamical system flow maps across timescales. This approach achieves state-of-the-art performance in less computational time compared to fixed-step neural network solvers. The proposed method is demonstrated on several nonlinear dynamical systems, and source codes are provided for implementation. This method has the potential to benefit multiscale analysis of complex systems and encourage further investigation in this area.  ( 2 min )
    Is a Seat at the Table Enough? Engaging Teachers and Students in Dataset Specification for ML in Education. (arXiv:2311.05792v1 [cs.CY])
    Despite the promises of ML in education, its adoption in the classroom has surfaced numerous issues regarding fairness, accountability, and transparency, as well as concerns about data privacy and student consent. A root cause of these issues is the lack of understanding of the complex dynamics of education, including teacher-student interactions, collaborative learning, and classroom environment. To overcome these challenges and fully utilize the potential of ML in education, software practitioners need to work closely with educators and students to fully understand the context of the data (the backbone of ML applications) and collaboratively define the ML data specifications. To gain a deeper understanding of such a collaborative process, we conduct ten co-design sessions with ML software practitioners, educators, and students. In the sessions, teachers and students work with ML engineers, UX designers, and legal practitioners to define dataset characteristics for a given ML application. We find that stakeholders contextualize data based on their domain and procedural knowledge, proactively design data requirements to mitigate downstream harms and data reliability concerns, and exhibit role-based collaborative strategies and contribution patterns. Further, we find that beyond a seat at the table, meaningful stakeholder participation in ML requires structured supports: defined processes for continuous iteration and co-evaluation, shared contextual data quality standards, and information scaffolds for both technical and non-technical stakeholders to traverse expertise boundaries.  ( 3 min )
    Structured Transforms Across Spaces with Cost-Regularized Optimal Transport. (arXiv:2311.05788v1 [cs.LG])
    Matching a source to a target probability measure is often solved by instantiating a linear optimal transport (OT) problem, parameterized by a ground cost function that quantifies discrepancy between points. When these measures live in the same metric space, the ground cost often defaults to its distance. When instantiated across two different spaces, however, choosing that cost in the absence of aligned data is a conundrum. As a result, practitioners often resort to solving instead a quadratic Gromow-Wasserstein (GW) problem. We exploit in this work a parallel between GW and cost-regularized OT, the regularized minimization of a linear OT objective parameterized by a ground cost. We use this cost-regularized formulation to match measures across two different Euclidean spaces, where the cost is evaluated between transformed source points and target points. We show that several quadratic OT problems fall in this category, and consider enforcing structure in linear transform (e.g. sparsity), by introducing structure-inducing regularizers. We provide a proximal algorithm to extract such transforms from unaligned data, and demonstrate its applicability to single-cell spatial transcriptomics/multiomics matching tasks.  ( 2 min )
    Testing Dependency of Unlabeled Databases. (arXiv:2311.05874v1 [cs.LG])
    In this paper, we investigate the problem of deciding whether two random databases $\mathsf{X}\in\mathcal{X}^{n\times d}$ and $\mathsf{Y}\in\mathcal{Y}^{n\times d}$ are statistically dependent or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these two databases are statistically independent, while under the alternative, there exists an unknown row permutation $\sigma$, such that $\mathsf{X}$ and $\mathsf{Y}^\sigma$, a permuted version of $\mathsf{Y}$, are statistically dependent with some known joint distribution, but have the same marginal distributions as the null. We characterize the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of $n$, $d$, and some spectral properties of the generative distributions of the datasets. For example, we prove that if a certain function of the eigenvalues of the likelihood function and $d$, is below a certain threshold, as $d\to\infty$, then weak detection (performing slightly better than random guessing) is statistically impossible, no matter what the value of $n$ is. This mimics the performance of an efficient test that thresholds a centered version of the log-likelihood function of the observed matrices. We also analyze the case where $d$ is fixed, for which we derive strong (vanishing error) and weak detection lower and upper bounds.  ( 2 min )
    Scale-MIA: A Scalable Model Inversion Attack against Secure Federated Learning via Latent Space Reconstruction. (arXiv:2311.05808v1 [cs.LG])
    Federated learning is known for its capability to safeguard participants' data privacy. However, recently emerged model inversion attacks (MIAs) have shown that a malicious parameter server can reconstruct individual users' local data samples through model updates. The state-of-the-art attacks either rely on computation-intensive search-based optimization processes to recover each input batch, making scaling difficult, or they involve the malicious parameter server adding extra modules before the global model architecture, rendering the attacks too conspicuous and easily detectable. To overcome these limitations, we propose Scale-MIA, a novel MIA capable of efficiently and accurately recovering training samples of clients from the aggregated updates, even when the system is under the protection of a robust secure aggregation protocol. Unlike existing approaches treating models as black boxes, Scale-MIA recognizes the importance of the intricate architecture and inner workings of machine learning models. It identifies the latent space as the critical layer for breaching privacy and decomposes the complex recovery task into an innovative two-step process to reduce computation complexity. The first step involves reconstructing the latent space representations (LSRs) from the aggregated model updates using a closed-form inversion mechanism, leveraging specially crafted adversarial linear layers. In the second step, the whole input batches are recovered from the LSRs by feeding them into a fine-tuned generative decoder. We implemented Scale-MIA on multiple commonly used machine learning models and conducted comprehensive experiments across various settings. The results demonstrate that Scale-MIA achieves excellent recovery performance on different datasets, exhibiting high reconstruction rates, accuracy, and attack efficiency on a larger scale compared to state-of-the-art MIAs.  ( 3 min )
    Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning. (arXiv:2305.13971v5 [cs.CL] UPDATED)
    Despite their impressive performance, large language models (LMs) still struggle with reliably generating complex output structures when not finetuned to follow the required output format exactly. To address this issue, grammar-constrained decoding (GCD) can be used to control the generation of LMs, guaranteeing that the output follows a given structure. Most existing GCD methods are, however, limited to specific tasks, such as parsing or code generation. In this work, we demonstrate that formal grammars can describe the output space for a much wider range of tasks and argue that GCD can serve as a unified framework for structured NLP tasks in general. For increased flexibility, we introduce input-dependent grammars, which allow the grammar to depend on the input and thus enable the generation of different output structures for different inputs. We then empirically demonstrate the power and flexibility of GCD-enhanced LMs on (1) information extraction, (2) entity disambiguation, and (3) constituency parsing. Our results indicate that grammar-constrained LMs substantially outperform unconstrained LMs or even beat task-specific finetuned models. Grammar constraints thus hold great promise for harnessing off-the-shelf LMs for a wide range of structured NLP tasks, especially where training data is scarce or finetuning is expensive. Code and data: https://github.com/epfl-dlab/GCD.  ( 3 min )
    Adaptive Variance Thresholding: A Novel Approach to Improve Existing Deep Transfer Vision Models and Advance Automatic Knee-Joint Osteoarthritis Classification. (arXiv:2311.05799v1 [eess.IV])
    Knee-Joint Osteoarthritis (KOA) is a prevalent cause of global disability and is inherently complex to diagnose due to its subtle radiographic markers and individualized progression. One promising classification avenue involves applying deep learning methods; however, these techniques demand extensive, diversified datasets, which pose substantial challenges due to medical data collection restrictions. Existing practices typically resort to smaller datasets and transfer learning. However, this approach often inherits unnecessary pre-learned features that can clutter the classifier's vector space, potentially hampering performance. This study proposes a novel paradigm for improving post-training specialized classifiers by introducing adaptive variance thresholding (AVT) followed by Neural Architecture Search (NAS). This approach led to two key outcomes: an increase in the initial accuracy of the pre-trained KOA models and a 60-fold reduction in the NAS input vector space, thus facilitating faster inference speed and a more efficient hyperparameter search. We also applied this approach to an external model trained for KOA classification. Despite its initial performance, the application of our methodology improved its average accuracy, making it one of the top three KOA classification models.  ( 2 min )
    Improved Positional Encoding for Implicit Neural Representation based Compact Data Representation. (arXiv:2311.06059v1 [cs.CV])
    Positional encodings are employed to capture the high frequency information of the encoded signals in implicit neural representation (INR). In this paper, we propose a novel positional encoding method which improves the reconstruction quality of the INR. The proposed embedding method is more advantageous for the compact data representation because it has a greater number of frequency basis than the existing methods. Our experiments shows that the proposed method achieves significant gain in the rate-distortion performance without introducing any additional complexity in the compression task and higher reconstruction quality in novel view synthesis.  ( 2 min )
    AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training. (arXiv:2311.05827v1 [cs.LG])
    It is usually infeasible to fit and train an entire large deep neural network (DNN) model using a single edge device due to the limited resources. To facilitate intelligent applications across edge devices, researchers have proposed partitioning a large model into several sub-models, and deploying each of them to a different edge device to collaboratively train a DNN model. However, the communication overhead caused by the large amount of data transmitted from one device to another during training, as well as the sub-optimal partition point due to the inaccurate latency prediction of computation at each edge device can significantly slow down training. In this paper, we propose AccEPT, an acceleration scheme for accelerating the edge collaborative pipeline-parallel training. In particular, we propose a light-weight adaptive latency predictor to accurately estimate the computation latency of each layer at different devices, which also adapts to unseen devices through continuous learning. Therefore, the proposed latency predictor leads to better model partitioning which balances the computation loads across participating devices. Moreover, we propose a bit-level computation-efficient data compression scheme to compress the data to be transmitted between devices during training. Our numerical results demonstrate that our proposed acceleration approach is able to significantly speed up edge pipeline parallel training up to 3 times faster in the considered experimental settings.  ( 2 min )
    Conceptual structure coheres in human cognition but not in large language models. (arXiv:2304.02754v2 [cs.AI] UPDATED)
    Neural network models of language have long been used as a tool for developing hypotheses about conceptual representation in the mind and brain. For many years, such use involved extracting vector-space representations of words and using distances among these to predict or understand human behavior in various semantic tasks. Contemporary large language models (LLMs), however, make it possible to interrogate the latent structure of conceptual representations using experimental methods nearly identical to those commonly used with human participants. The current work utilizes three common techniques borrowed from cognitive psychology to estimate and compare the structure of concepts in humans and a suite of LLMs. In humans, we show that conceptual structure is robust to differences in culture, language, and method of estimation. Structures estimated from LLM behavior, while individually fairly consistent with those estimated from human behavior, vary much more depending upon the particular task used to generate responses--across tasks, estimates of conceptual structure from the very same model cohere less with one another than do human structure estimates. These results highlight an important difference between contemporary LLMs and human cognition, with implications for understanding some fundamental limitations of contemporary machine language.  ( 2 min )
    Synthesizing Bidirectional Temporal States of Knee Osteoarthritis Radiographs with Cycle-Consistent Generative Adversarial Neural Networks. (arXiv:2311.05798v1 [eess.IV])
    Knee Osteoarthritis (KOA), a leading cause of disability worldwide, is challenging to detect early due to subtle radiographic indicators. Diverse, extensive datasets are needed but are challenging to compile because of privacy, data collection limitations, and the progressive nature of KOA. However, a model capable of projecting genuine radiographs into different OA stages could augment data pools, enhance algorithm training, and offer pre-emptive prognostic insights. In this study, we trained a CycleGAN model to synthesize past and future stages of KOA on any genuine radiograph. The model was validated using a Convolutional Neural Network that was deceived into misclassifying disease stages in transformed images, demonstrating the CycleGAN's ability to effectively transform disease characteristics forward or backward in time. The model was particularly effective in synthesizing future disease states and showed an exceptional ability to retroactively transition late-stage radiographs to earlier stages by eliminating osteophytes and expanding knee joint space, signature characteristics of None or Doubtful KOA. The model's results signify a promising potential for enhancing diagnostic models, data augmentation, and educational and prognostic usage in healthcare. Nevertheless, further refinement, validation, and a broader evaluation process encompassing both CNN-based assessments and expert medical feedback are emphasized for future research and development.  ( 2 min )
    Clipped-Objective Policy Gradients for Pessimistic Policy Optimization. (arXiv:2311.05846v1 [cs.LG])
    To facilitate efficient learning, policy gradient approaches to deep reinforcement learning (RL) are typically paired with variance reduction measures and strategies for making large but safe policy changes based on a batch of experiences. Natural policy gradient methods, including Trust Region Policy Optimization (TRPO), seek to produce monotonic improvement through bounded changes in policy outputs. Proximal Policy Optimization (PPO) is a commonly used, first-order algorithm that instead uses loss clipping to take multiple safe optimization steps per batch of data, replacing the bound on the single step of TRPO with regularization on multiple steps. In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective. Instead of the importance sampling objective of PPO, we instead recommend a basic policy gradient, clipped in an equivalent fashion. While both objectives produce biased gradient estimates with respect to the RL objective, they also both display significantly reduced variance compared to the unbiased off-policy policy gradient. Additionally, we show that (1) the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration. As a result, we empirically observe that COPG produces improved learning compared to PPO in single-task, constrained, and multi-task learning, without adding significant computational cost or complexity. Compared to TRPO, the COPG approach is seen to offer comparable or superior performance, while retaining the simplicity of a first-order method.  ( 2 min )
    Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization. (arXiv:2311.06243v1 [cs.LG])
    Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language.  ( 2 min )
    Ulcerative Colitis Mayo Endoscopic Scoring Classification with Active Learning and Generative Data Augmentation. (arXiv:2311.06057v1 [eess.IV])
    Endoscopic imaging is commonly used to diagnose Ulcerative Colitis (UC) and classify its severity. It has been shown that deep learning based methods are effective in automated analysis of these images and can potentially be used to aid medical doctors. Unleashing the full potential of these methods depends on the availability of large amount of labeled images; however, obtaining and labeling these images are quite challenging. In this paper, we propose a active learning based generative augmentation method. The method involves generating a large number of synthetic samples by training using a small dataset consisting of real endoscopic images. The resulting data pool is narrowed down by using active learning methods to select the most informative samples, which are then used to train a classifier. We demonstrate the effectiveness of our method through experiments on a publicly available endoscopic image dataset. The results show that using synthesized samples in conjunction with active learning leads to improved classification performance compared to using only the original labeled examples and the baseline classification performance of 68.1% increases to 74.5% in terms of Quadratic Weighted Kappa (QWK) Score. Another observation is that, attaining equivalent performance using only real data necessitated three times higher number of images.  ( 3 min )
    Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration. (arXiv:2311.06062v1 [cs.CL])
    Membership Inference Attacks (MIA) aim to infer whether a target data record has been utilized for model training or not. Prior attempts have quantified the privacy risks of language models (LMs) via MIAs, but there is still no consensus on whether existing MIA algorithms can cause remarkable privacy leakage on practical Large Language Models (LLMs). Existing MIAs designed for LMs can be classified into two categories: reference-free and reference-based attacks. They are both based on the hypothesis that training records consistently strike a higher probability of being sampled. Nevertheless, this hypothesis heavily relies on the overfitting of target models, which will be mitigated by multiple regularization methods and the generalization of LLMs. The reference-based attack seems to achieve promising effectiveness in LLMs, which measures a more reliable membership signal by comparing the probability discrepancy between the target model and the reference model. However, the performance of reference-based attack is highly dependent on a reference dataset that closely resembles the training dataset, which is usually inaccessible in the practical scenario. Overall, existing MIAs are unable to effectively unveil privacy leakage over practical fine-tuned LLMs that are overfitting-free and private. We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA). Specifically, since memorization in LLMs is inevitable during the training process and occurs before overfitting, we introduce a more reliable membership signal, probabilistic variation, which is based on memorization rather than overfitting. Furthermore, we introduce a self-prompt approach, which constructs the dataset to fine-tune the reference model by prompting the target LLM itself. In this manner, the adversary can collect a dataset with a similar distribution from public APIs.  ( 3 min )
    Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models. (arXiv:2311.06233v1 [cs.CL])
    We propose the Data Contamination Quiz, a simple and effective approach to detect data contamination in large language models (LLMs) and estimate the amount of it. Specifically, we frame data contamination detection as a series of multiple-choice questions. We devise a quiz format wherein three perturbed versions of each dataset instance are created. These changes only include word-level perturbations, replacing words with their contextual synonyms, ensuring both the semantic and sentence structure remain exactly the same as the original instance. Together with the original instance, these perturbed versions constitute the choices in the quiz. Given that the only distinguishing signal among these choices is the exact wording, an LLM, when tasked with identifying the original instance from the choices, opts for the original if it has memorized it in its pre-training phase--a trait intrinsic to LLMs. A dataset partition is then marked as contaminated if the LLM's performance on the quiz surpasses what random chance suggests. Our evaluation spans seven datasets and their respective splits (train and test/validation) on two state-of-the-art LLMs: GPT-4 and GPT-3.5. While lacking access to the pre-training data, our results suggest that our approach not only enhances the detection of data contamination but also provides an accurate estimation of its extent, even when the contamination signal is weak.  ( 2 min )
    An alternative for one-hot encoding in neural network models. (arXiv:2311.05911v1 [cs.LG])
    This paper proposes an algorithm that implements binary encoding of the categorical features of neural network model input data, while also implementing changes in the forward and backpropagation procedures in order to achieve the property of having model weight changes, that result from the neural network learning process for certain data instances of some feature category, only affect the forward pass calculations for input data instances of that same feature category, as it is in the case of utilising one-hot encoding for categorical features.  ( 2 min )
    Chanakya: Learning Runtime Decisions for Adaptive Real-Time Perception. (arXiv:2106.05665v3 [cs.CV] UPDATED)
    Real-time perception requires planned resource utilization. Computational planning in real-time perception is governed by two considerations -- accuracy and latency. There exist run-time decisions (e.g. choice of input resolution) that induce tradeoffs affecting performance on a given hardware, arising from intrinsic (content, e.g. scene clutter) and extrinsic (system, e.g. resource contention) characteristics. Earlier runtime execution frameworks employed rule-based decision algorithms and operated with a fixed algorithm latency budget to balance these concerns, which is sub-optimal and inflexible. We propose Chanakya, a learned approximate execution framework that naturally derives from the streaming perception paradigm, to automatically learn decisions induced by these tradeoffs instead. Chanakya is trained via novel rewards balancing accuracy and latency implicitly, without approximating either objectives. Chanakya simultaneously considers intrinsic and extrinsic context, and predicts decisions in a flexible manner. Chanakya, designed with low overhead in mind, outperforms state-of-the-art static and dynamic execution policies on public datasets on both server GPUs and edge devices.  ( 2 min )
    The Paradox of Noise: An Empirical Study of Noise-Infusion Mechanisms to Improve Generalization, Stability, and Privacy in Federated Learning. (arXiv:2311.05790v1 [cs.LG])
    In a data-centric era, concerns regarding privacy and ethical data handling grow as machine learning relies more on personal information. This empirical study investigates the privacy, generalization, and stability of deep learning models in the presence of additive noise in federated learning frameworks. Our main objective is to provide strategies to measure the generalization, stability, and privacy-preserving capabilities of these models and further improve them. To this end, five noise infusion mechanisms at varying noise levels within centralized and federated learning settings are explored. As model complexity is a key component of the generalization and stability of deep learning models during training and evaluation, a comparative analysis of three Convolutional Neural Network (CNN) architectures is provided. The paper introduces Signal-to-Noise Ratio (SNR) as a quantitative measure of the trade-off between privacy and training accuracy of noise-infused models, aiming to find the noise level that yields optimal privacy and accuracy. Moreover, the Price of Stability and Price of Anarchy are defined in the context of privacy-preserving deep learning, contributing to the systematic investigation of the noise infusion strategies to enhance privacy without compromising performance. Our research sheds light on the delicate balance between these critical factors, fostering a deeper understanding of the implications of noise-based regularization in machine learning. By leveraging noise as a tool for regularization and privacy enhancement, we aim to contribute to the development of robust, privacy-aware algorithms, ensuring that AI-driven solutions prioritize both utility and privacy.  ( 3 min )
    Towards stable real-world equation discovery with assessing differentiating quality influence. (arXiv:2311.05787v1 [cs.LG])
    This paper explores the critical role of differentiation approaches for data-driven differential equation discovery. Accurate derivatives of the input data are essential for reliable algorithmic operation, particularly in real-world scenarios where measurement quality is inevitably compromised. We propose alternatives to the commonly used finite differences-based method, notorious for its instability in the presence of noise, which can exacerbate random errors in the data. Our analysis covers four distinct methods: Savitzky-Golay filtering, spectral differentiation, smoothing based on artificial neural networks, and the regularization of derivative variation. We evaluate these methods in terms of applicability to problems, similar to the real ones, and their ability to ensure the convergence of equation discovery algorithms, providing valuable insights for robust modeling of real-world processes.  ( 2 min )
    OmniVec: Learning robust representations with cross modal sharing. (arXiv:2311.05709v1 [cs.CV])
    Majority of research in learning based methods has been towards designing and training networks for specific tasks. However, many of the learning based tasks, across modalities, share commonalities and could be potentially tackled in a joint framework. We present an approach in such direction, to learn multiple tasks, in multiple modalities, with a unified architecture. The proposed network is composed of task specific encoders, a common trunk in the middle, followed by task specific prediction heads. We first pre-train it by self-supervised masked training, followed by sequential training for the different tasks. We train the network on all major modalities, e.g.\ visual, audio, text and 3D, and report results on $22$ diverse and challenging public benchmarks. We demonstrate empirically that, using a joint network to train across modalities leads to meaningful information sharing and this allows us to achieve state-of-the-art results on most of the benchmarks. We also show generalization of the trained network on cross-modal tasks as well as unseen datasets and tasks.  ( 2 min )
    Argumentation Element Annotation Modeling using XLNet. (arXiv:2311.06239v1 [cs.CL])
    This study demonstrates the effectiveness of XLNet, a transformer-based language model, for annotating argumentative elements in persuasive essays. XLNet's architecture incorporates a recurrent mechanism that allows it to model long-term dependencies in lengthy texts. Fine-tuned XLNet models were applied to three datasets annotated with different schemes - a proprietary dataset using the Annotations for Revisions and Reflections on Writing (ARROW) scheme, the PERSUADE corpus, and the Argument Annotated Essays (AAE) dataset. The XLNet models achieved strong performance across all datasets, even surpassing human agreement levels in some cases. This shows XLNet capably handles diverse annotation schemes and lengthy essays. Comparisons between the model outputs on different datasets also revealed insights into the relationships between the annotation tags. Overall, XLNet's strong performance on modeling argumentative structures across diverse datasets highlights its suitability for providing automated feedback on essay organization.  ( 2 min )
    Frequency-domain MLPs are More Effective Learners in Time Series Forecasting. (arXiv:2311.06184v1 [cs.LG])
    Time series forecasting has played the key role in different industrial, including finance, traffic, energy, and healthcare domains. While existing literatures have designed many sophisticated architectures based on RNNs, GNNs, or Transformers, another kind of approaches based on multi-layer perceptrons (MLPs) are proposed with simple structure, low complexity, and {superior performance}. However, most MLP-based forecasting methods suffer from the point-wise mappings and information bottleneck, which largely hinders the forecasting performance. To overcome this problem, we explore a novel direction of applying MLPs in the frequency domain for time series forecasting. We investigate the learned patterns of frequency-domain MLPs and discover their two inherent characteristic benefiting forecasting, (i) global view: frequency spectrum makes MLPs own a complete view for signals and learn global dependencies more easily, and (ii) energy compaction: frequency-domain MLPs concentrate on smaller key part of frequency components with compact signal energy. Then, we propose FreTS, a simple yet effective architecture built upon Frequency-domain MLPs for Time Series forecasting. FreTS mainly involves two stages, (i) Domain Conversion, that transforms time-domain signals into complex numbers of frequency domain; (ii) Frequency Learning, that performs our redesigned MLPs for the learning of real and imaginary part of frequency components. The above stages operated on both inter-series and intra-series scales further contribute to channel-wise and time-wise dependency learning. Extensive experiments on 13 real-world benchmarks (including 7 benchmarks for short-term forecasting and 6 benchmarks for long-term forecasting) demonstrate our consistent superiority over state-of-the-art methods.  ( 3 min )
    Plasma Surrogate Modelling using Fourier Neural Operators. (arXiv:2311.05967v1 [physics.plasm-ph])
    Predicting plasma evolution within a Tokamak reactor is crucial to realizing the goal of sustainable fusion. Capabilities in forecasting the spatio-temporal evolution of plasma rapidly and accurately allow us to quickly iterate over design and control strategies on current Tokamak devices and future reactors. Modelling plasma evolution using numerical solvers is often expensive, consuming many hours on supercomputers, and hence, we need alternative inexpensive surrogate models. We demonstrate accurate predictions of plasma evolution both in simulation and experimental domains using deep learning-based surrogate modelling tools, viz., Fourier Neural Operators (FNO). We show that FNO has a speedup of six orders of magnitude over traditional solvers in predicting the plasma dynamics simulated from magnetohydrodynamic models, while maintaining a high accuracy (MSE $\approx$ $10^{-5}$). Our modified version of the FNO is capable of solving multi-variable Partial Differential Equations (PDE), and can capture the dependence among the different variables in a single model. FNOs can also predict plasma evolution on real-world experimental data observed by the cameras positioned within the MAST Tokamak, i.e., cameras looking across the central solenoid and the divertor in the Tokamak. We show that FNOs are able to accurately forecast the evolution of plasma and have the potential to be deployed for real-time monitoring. We also illustrate their capability in forecasting the plasma shape, the locations of interactions of the plasma with the central solenoid and the divertor for the full duration of the plasma shot within MAST. The FNO offers a viable alternative for surrogate modelling as it is quick to train and infer, and requires fewer data points, while being able to do zero-shot super-resolution and getting high-fidelity solutions.  ( 3 min )
    The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models. (arXiv:2311.05928v1 [cs.CL])
    In this study, we present an investigation into the anisotropy dynamics and intrinsic dimension of embeddings in transformer architectures, focusing on the dichotomy between encoders and decoders. Our findings reveal that the anisotropy profile in transformer decoders exhibits a distinct bell-shaped curve, with the highest anisotropy concentrations in the middle layers. This pattern diverges from the more uniformly distributed anisotropy observed in encoders. In addition, we found that the intrinsic dimension of embeddings increases in the initial phases of training, indicating an expansion into higher-dimensional space. Which is then followed by a compression phase towards the end of training with dimensionality decrease, suggesting a refinement into more compact representations. Our results provide fresh insights to the understanding of encoders and decoders embedding properties.  ( 2 min )
    FourierGNN: Rethinking Multivariate Time Series Forecasting from a Pure Graph Perspective. (arXiv:2311.06190v1 [cs.LG])
    Multivariate time series (MTS) forecasting has shown great importance in numerous industries. Current state-of-the-art graph neural network (GNN)-based forecasting methods usually require both graph networks (e.g., GCN) and temporal networks (e.g., LSTM) to capture inter-series (spatial) dynamics and intra-series (temporal) dependencies, respectively. However, the uncertain compatibility of the two networks puts an extra burden on handcrafted model designs. Moreover, the separate spatial and temporal modeling naturally violates the unified spatiotemporal inter-dependencies in real world, which largely hinders the forecasting performance. To overcome these problems, we explore an interesting direction of directly applying graph networks and rethink MTS forecasting from a pure graph perspective. We first define a novel data structure, hypervariate graph, which regards each series value (regardless of variates or timestamps) as a graph node, and represents sliding windows as space-time fully-connected graphs. This perspective considers spatiotemporal dynamics unitedly and reformulates classic MTS forecasting into the predictions on hypervariate graphs. Then, we propose a novel architecture Fourier Graph Neural Network (FourierGNN) by stacking our proposed Fourier Graph Operator (FGO) to perform matrix multiplications in Fourier space. FourierGNN accommodates adequate expressiveness and achieves much lower complexity, which can effectively and efficiently accomplish the forecasting. Besides, our theoretical analysis reveals FGO's equivalence to graph convolutions in the time domain, which further verifies the validity of FourierGNN. Extensive experiments on seven datasets have demonstrated our superior performance with higher efficiency and fewer parameters compared with state-of-the-art methods.  ( 3 min )
    ADaPT: As-Needed Decomposition and Planning with Language Models. (arXiv:2311.05772v1 [cs.AI])
    Large Language Models (LLMs) are increasingly being used for interactive decision-making tasks requiring planning and adapting to the environment. Recent works employ LLMs-as-agents in broadly two ways: iteratively determining the next action (iterative executors) or generating plans and executing sub-tasks using LLMs (plan-and-execute). However, these methods struggle with task complexity, as the inability to execute any sub-task may lead to task failure. To address these shortcomings, we introduce As-Needed Decomposition and Planning for complex Tasks (ADaPT), an approach that explicitly plans and decomposes complex sub-tasks as-needed, i.e., when the LLM is unable to execute them. ADaPT recursively decomposes sub-tasks to adapt to both task complexity and LLM capability. Our results demonstrate that ADaPT substantially outperforms established strong baselines, achieving success rates up to 28.3% higher in ALFWorld, 27% in WebShop, and 33% in TextCraft -- a novel compositional dataset that we introduce. Through extensive analysis, we illustrate the importance of multilevel decomposition and establish that ADaPT dynamically adjusts to the capabilities of the executor LLM as well as to task complexity.  ( 2 min )
    EVORA: Deep Evidential Traversability Learning for Risk-Aware Off-Road Autonomy. (arXiv:2311.06234v1 [cs.RO])
    Traversing terrain with good traction is crucial for achieving fast off-road navigation. Instead of manually designing costs based on terrain features, existing methods learn terrain properties directly from data via self-supervision, but challenges remain to properly quantify and mitigate risks due to uncertainties in learned models. This work efficiently quantifies both aleatoric and epistemic uncertainties by learning discrete traction distributions and probability densities of the traction predictor's latent features. Leveraging evidential deep learning, we parameterize Dirichlet distributions with the network outputs and propose a novel uncertainty-aware squared Earth Mover's distance loss with a closed-form expression that improves learning accuracy and navigation performance. The proposed risk-aware planner simulates state trajectories with the worst-case expected traction to handle aleatoric uncertainty, and penalizes trajectories moving through terrain with high epistemic uncertainty. Our approach is extensively validated in simulation and on wheeled and quadruped robots, showing improved navigation performance compared to methods that assume no slip, assume the expected traction, or optimize for the worst-case expected cost.  ( 2 min )
    Time Scale Network: A Shallow Neural Network For Time Series Data. (arXiv:2311.06170v1 [cs.LG])
    Time series data is often composed of information at multiple time scales, particularly in biomedical data. While numerous deep learning strategies exist to capture this information, many make networks larger, require more data, are more demanding to compute, and are difficult to interpret. This limits their usefulness in real-world applications facing even modest computational or data constraints and can further complicate their translation into practice. We present a minimal, computationally efficient Time Scale Network combining the translation and dilation sequence used in discrete wavelet transforms with traditional convolutional neural networks and back-propagation. The network simultaneously learns features at many time scales for sequence classification with significantly reduced parameters and operations. We demonstrate advantages in Atrial Dysfunction detection including: superior accuracy-per-parameter and accuracy-per-operation, fast training and inference speeds, and visualization and interpretation of learned patterns in atrial dysfunction detection on ECG signals. We also demonstrate impressive performance in seizure prediction using EEG signals. Our network isolated a few time scales that could be strategically selected to achieve 90.9% accuracy using only 1,133 active parameters and consistently converged on pulsatile waveform shapes. This method does not rest on any constraints or assumptions regarding signal content and could be leveraged in any area of time series analysis dealing with signals containing features at many time scales.  ( 2 min )
    An Automated Pipeline for Tumour-Infiltrating Lymphocyte Scoring in Breast Cancer. (arXiv:2311.06185v1 [eess.IV])
    Tumour-infiltrating lymphocytes (TILs) are considered as a valuable prognostic markers in both triple-negative and human epidermal growth factor receptor 2 (HER2) breast cancer. In this study, we introduce an innovative deep learning pipeline based on the Efficient-UNet architecture to compute a TILs score for breast cancer whole slide images. Our pipeline first segments tumour-stroma regions and generates a tumour bulk mask. Subsequently, it detects TILs within the tumour-associated stroma, generating a TILs score by closely mirroring the pathologist's workflow. Our method exhibits state-of-the-art performance in segmenting tumour/stroma areas and TILs detection, as demonstrated by internal cross-validation on the TiGER Challenge training dataset and evaluation on the final leaderboards. Additionally, our TILs score proves competitive in predicting survival outcomes within the same challenge, underscoring the clinical relevance and potential of our automated TILs scoring system as a breast cancer prognostic tool.  ( 2 min )
    MALCOM-PSGD: Inexact Proximal Stochastic Gradient Descent for Communication-Efficient Decentralized Machine Learning. (arXiv:2311.05760v1 [cs.LG])
    Recent research indicates that frequent model communication stands as a major bottleneck to the efficiency of decentralized machine learning (ML), particularly for large-scale and over-parameterized neural networks (NNs). In this paper, we introduce MALCOM-PSGD, a new decentralized ML algorithm that strategically integrates gradient compression techniques with model sparsification. MALCOM-PSGD leverages proximal stochastic gradient descent to handle the non-smoothness resulting from the $\ell_1$ regularization in model sparsification. Furthermore, we adapt vector source coding and dithering-based quantization for compressed gradient communication of sparsified models. Our analysis shows that decentralized proximal stochastic gradient descent with compressed communication has a convergence rate of $\mathcal{O}\left(\ln(t)/\sqrt{t}\right)$ assuming a diminishing learning rate and where $t$ denotes the number of iterations. Numerical results verify our theoretical findings and demonstrate that our method reduces communication costs by approximately $75\%$ when compared to the state-of-the-art method.  ( 2 min )
    Differentiable VQ-VAE's for Robust White Matter Streamline Encodings. (arXiv:2311.06212v1 [stat.ML])
    Given the complex geometry of white matter streamlines, Autoencoders have been proposed as a dimension-reduction tool to simplify the analysis streamlines in a low-dimensional latent spaces. However, despite these recent successes, the majority of encoder architectures only perform dimension reduction on single streamlines as opposed to a full bundle of streamlines. This is a severe limitation of the encoder architecture that completely disregards the global geometric structure of streamlines at the expense of individual fibers. Moreover, the latent space may not be well structured which leads to doubt into their interpretability. In this paper we propose a novel Differentiable Vector Quantized Variational Autoencoder, which are engineered to ingest entire bundles of streamlines as single data-point and provides reliable trustworthy encodings that can then be later used to analyze streamlines in the latent space. Comparisons with several state of the art Autoencoders demonstrate superior performance in both encoding and synthesis.  ( 2 min )
    Learning material synthesis-structure-property relationship by data fusion: Bayesian Co-regionalization N-Dimensional Piecewise Function Learning. (arXiv:2311.06228v1 [cs.LG])
    Advanced materials are needed to further next-generation technologies such as quantum computing, carbon capture, and low-cost medical imaging. However, advanced materials discovery is confounded by two fundamental challenges: the challenge of a high-dimensional, complex materials search space and the challenge of combining knowledge, i.e., data fusion across instruments and labs. To overcome the first challenge, researchers employ knowledge of the underlying material synthesis-structure-property relationship, as a material's structure is often predictive of its functional property and vice versa. For example, optimal materials often occur along composition-phase boundaries or within specific phase regions. Additionally, knowledge of the synthesis-structure-property relationship is fundamental to understanding underlying physical mechanisms. However, quantifying the synthesis-structure-property relationship requires overcoming the second challenge. Researchers must merge knowledge gathered across instruments, measurement modalities, and even laboratories. We present the Synthesis-structure-property relAtionship coreGionalized lEarner (SAGE) algorithm. A fully Bayesian algorithm that uses multimodal coregionalization to merge knowledge across data sources to learn synthesis-structure-property relationships.  ( 2 min )
    Improvements on Uncertainty Quantification for Node Classification via Distance-Based Regularization. (arXiv:2311.05795v1 [cs.LG])
    Deep neural networks have achieved significant success in the last decades, but they are not well-calibrated and often produce unreliable predictions. A large number of literature relies on uncertainty quantification to evaluate the reliability of a learning model, which is particularly important for applications of out-of-distribution (OOD) detection and misclassification detection. We are interested in uncertainty quantification for interdependent node-level classification. We start our analysis based on graph posterior networks (GPNs) that optimize the uncertainty cross-entropy (UCE)-based loss function. We describe the theoretical limitations of the widely-used UCE loss. To alleviate the identified drawbacks, we propose a distance-based regularization that encourages clustered OOD nodes to remain clustered in the latent space. We conduct extensive comparison experiments on eight standard datasets and demonstrate that the proposed regularization outperforms the state-of-the-art in both OOD detection and misclassification detection.  ( 2 min )
    Verilog-to-PyG -- A Framework for Graph Learning and Augmentation on RTL Designs. (arXiv:2311.05722v1 [cs.LG])
    The complexity of modern hardware designs necessitates advanced methodologies for optimizing and analyzing modern digital systems. In recent times, machine learning (ML) methodologies have emerged as potent instruments for assessing design quality-of-results at the Register-Transfer Level (RTL) or Boolean level, aiming to expedite design exploration of advanced RTL configurations. In this presentation, we introduce an innovative open-source framework that translates RTL designs into graph representation foundations, which can be seamlessly integrated with the PyTorch Geometric graph learning platform. Furthermore, the Verilog-to-PyG (V2PYG) framework is compatible with the open-source Electronic Design Automation (EDA) toolchain OpenROAD, facilitating the collection of labeled datasets in an utterly open-source manner. Additionally, we will present novel RTL data augmentation methods (incorporated in our framework) that enable functional equivalent design augmentation for the construction of an extensive graph-based RTL design database. Lastly, we will showcase several using cases of V2PYG with detailed scripting examples. V2PYG can be found at \url{https://yu-maryland.github.io/Verilog-to-PyG/}.  ( 2 min )
    LogShield: A Transformer-based APT Detection System Leveraging Self-Attention. (arXiv:2311.05733v1 [cs.CR])
    Cyber attacks are often identified using system and network logs. There have been significant prior works that utilize provenance graphs and ML techniques to detect attacks, specifically advanced persistent threats, which are very difficult to detect. Lately, there have been studies where transformer-based language models are being used to detect various types of attacks from system logs. However, no such attempts have been made in the case of APTs. In addition, existing state-of-the-art techniques that use system provenance graphs, lack a data processing framework generalized across datasets for optimal performance. For mitigating this limitation as well as exploring the effectiveness of transformer-based language models, this paper proposes LogShield, a framework designed to detect APT attack patterns leveraging the power of self-attention in transformers. We incorporate customized embedding layers to effectively capture the context of event sequences derived from provenance graphs. While acknowledging the computational overhead associated with training transformer networks, our framework surpasses existing LSTM and Language models regarding APT detection. We integrated the model parameters and training procedure from the RoBERTa model and conducted extensive experiments on well-known APT datasets (DARPA OpTC and DARPA TC E3). Our framework achieved superior F1 scores of 98% and 95% on the two datasets respectively, surpassing the F1 scores of 96% and 94% obtained by LSTM models. Our findings suggest that LogShield's performance benefits from larger datasets and demonstrates its potential for generalization across diverse domains. These findings contribute to the advancement of APT attack detection methods and underscore the significance of transformer-based architectures in addressing security challenges in computer systems.  ( 3 min )
    Detecting Suspicious Commenter Mob Behaviors on YouTube Using Graph2Vec. (arXiv:2311.05791v1 [cs.SI])
    YouTube, a widely popular online platform, has transformed the dynamics of con-tent consumption and interaction for users worldwide. With its extensive range of content crea-tors and viewers, YouTube serves as a hub for video sharing, entertainment, and information dissemination. However, the exponential growth of users and their active engagement on the platform has raised concerns regarding suspicious commenter behaviors, particularly in the com-ment section. This paper presents a social network analysis-based methodology for detecting suspicious commenter mob-like behaviors among YouTube channels and the similarities therein. The method aims to characterize channels based on the level of such behavior and identify com-mon patterns across them. To evaluate the effectiveness of the proposed model, we conducted an analysis of 20 YouTube channels, consisting of 7,782 videos, 294,199 commenters, and 596,982 comments. These channels were specifically selected for propagating false views about the U.S. Military. The analysis revealed significant similarities among the channels, shedding light on the prevalence of suspicious commenter behavior. By understanding these similarities, we contribute to a better understanding of the dynamics of suspicious behavior on YouTube channels, which can inform strategies for addressing and mitigating such behavior.  ( 2 min )
    Neural Network Methods for Radiation Detectors and Imaging. (arXiv:2311.05726v1 [physics.ins-det])
    Recent advances in image data processing through machine learning and especially deep neural networks (DNNs) allow for new optimization and performance-enhancement schemes for radiation detectors and imaging hardware through data-endowed artificial intelligence. We give an overview of data generation at photon sources, deep learning-based methods for image processing tasks, and hardware solutions for deep learning acceleration. Most existing deep learning approaches are trained offline, typically using large amounts of computational resources. However, once trained, DNNs can achieve fast inference speeds and can be deployed to edge devices. A new trend is edge computing with less energy consumption (hundreds of watts or less) and real-time analysis potential. While popularly used for edge computing, electronic-based hardware accelerators ranging from general purpose processors such as central processing units (CPUs) to application-specific integrated circuits (ASICs) are constantly reaching performance limits in latency, energy consumption, and other physical constraints. These limits give rise to next-generation analog neuromorhpic hardware platforms, such as optical neural networks (ONNs), for high parallel, low latency, and low energy computing to boost deep learning acceleration.  ( 2 min )
    A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning. (arXiv:2311.05877v1 [cs.LG])
    Academic tabular benchmarks often contain small sets of curated features. In contrast, data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. To prevent overfitting in subsequent downstream modeling, practitioners commonly use automated feature selection methods that identify a reduced subset of informative features. Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance. Motivated by the increasing popularity of tabular deep learning, we construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers, using real datasets and multiple methods for generating extraneous features. We also propose an input-gradient-based analogue of Lasso for neural networks that outperforms classical feature selection methods on challenging problems such as selecting from corrupted or second-order features.
    Real-time Control of Electric Autonomous Mobility-on-Demand Systems via Graph Reinforcement Learning. (arXiv:2311.05780v1 [eess.SY])
    Operators of Electric Autonomous Mobility-on-Demand (E-AMoD) fleets need to make several real-time decisions such as matching available cars to ride requests, rebalancing idle cars to areas of high demand, and charging vehicles to ensure sufficient range. While this problem can be posed as a linear program that optimizes flows over a space-charge-time graph, the size of the resulting optimization problem does not allow for real-time implementation in realistic settings. In this work, we present the E-AMoD control problem through the lens of reinforcement learning and propose a graph network-based framework to achieve drastically improved scalability and superior performance over heuristics. Specifically, we adopt a bi-level formulation where we (1) leverage a graph network-based RL agent to specify a desired next state in the space-charge graph, and (2) solve more tractable linear programs to best achieve the desired state while ensuring feasibility. Experiments using real-world data from San Francisco and New York City show that our approach achieves up to 89% of the profits of the theoretically-optimal solution while achieving more than a 100x speedup in computational time. Furthermore, our approach outperforms the best domain-specific heuristics with comparable runtimes, with an increase in profits by up to 3x. Finally, we highlight promising zero-shot transfer capabilities of our learned policy on tasks such as inter-city generalization and service area expansion, thus showing the utility, scalability, and flexibility of our framework.  ( 3 min )
    Enhancing Instance-Level Image Classification with Set-Level Labels. (arXiv:2311.05659v1 [cs.LG])
    Instance-level image classification tasks have traditionally relied on single-instance labels to train models, e.g., few-shot learning and transfer learning. However, set-level coarse-grained labels that capture relationships among instances can provide richer information in real-world scenarios. In this paper, we present a novel approach to enhance instance-level image classification by leveraging set-level labels. We provide a theoretical analysis of the proposed method, including recognition conditions for fast excess risk rate, shedding light on the theoretical foundations of our approach. We conducted experiments on two distinct categories of datasets: natural image datasets and histopathology image datasets. Our experimental results demonstrate the effectiveness of our approach, showcasing improved classification performance compared to traditional single-instance label-based methods. Notably, our algorithm achieves 13% improvement in classification accuracy compared to the strongest baseline on the histopathology image classification benchmarks. Importantly, our experimental findings align with the theoretical analysis, reinforcing the robustness and reliability of our proposed method. This work bridges the gap between instance-level and set-level image classification, offering a promising avenue for advancing the capabilities of image classification models with set-level coarse-grained labels.  ( 2 min )
    Explainable artificial intelligence for Healthcare applications using Random Forest Classifier with LIME and SHAP. (arXiv:2311.05665v1 [cs.LG])
    With the advances in computationally efficient artificial Intelligence (AI) techniques and their numerous applications in our everyday life, there is a pressing need to understand the computational details hidden in black box AI techniques such as most popular machine learning and deep learning techniques; through more detailed explanations. The origin of explainable AI (xAI) is coined from these challenges and recently gained more attention by the researchers by adding explainability comprehensively in traditional AI systems. This leads to develop an appropriate framework for successful applications of xAI in real life scenarios with respect to innovations, risk mitigation, ethical issues and logical values to the users. In this book chapter, an in-depth analysis of several xAI frameworks and methods including LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations) are provided. Random Forest Classifier as black box AI is used on a publicly available Diabetes symptoms dataset with LIME and SHAP for better interpretations. The results obtained are interesting in terms of transparency, valid and trustworthiness in diabetes disease prediction.  ( 2 min )
    Efficiently Adapting Pretrained Language Models To New Languages. (arXiv:2311.05741v1 [cs.CL])
    Recent large language models (LLM) exhibit sub-optimal performance on low-resource languages, as the training data of these models is usually dominated by English and other high-resource languages. Furthermore, it is challenging to train models for low-resource languages, especially from scratch, due to a lack of high quality training data. Adapting pretrained LLMs reduces the need for data in the new language while also providing cross lingual transfer capabilities. However, naively adapting to new languages leads to catastrophic forgetting and poor tokenizer efficiency. In this work, we study how to efficiently adapt any existing pretrained LLM to a new language without running into these issues. In particular, we improve the encoding efficiency of the tokenizer by adding new tokens from the target language and study the data mixing recipe to mitigate forgetting. Our experiments on adapting an English LLM to Hungarian and Thai show that our recipe can reach better performance than open source models on the target language, with minimal regressions on English.  ( 2 min )
    Generative Explanations for Graph Neural Network: Methods and Evaluations. (arXiv:2311.05764v1 [cs.LG])
    Graph Neural Networks (GNNs) achieve state-of-the-art performance in various graph-related tasks. However, the black-box nature often limits their interpretability and trustworthiness. Numerous explainability methods have been proposed to uncover the decision-making logic of GNNs, by generating underlying explanatory substructures. In this paper, we conduct a comprehensive review of the existing explanation methods for GNNs from the perspective of graph generation. Specifically, we propose a unified optimization objective for generative explanation methods, comprising two sub-objectives: Attribution and Information constraints. We further demonstrate their specific manifestations in various generative model architectures and different explanation scenarios. With the unified objective of the explanation problem, we reveal the shared characteristics and distinctions among current methods, laying the foundation for future methodological advancements. Empirical results demonstrate the advantages and limitations of different explainability approaches in terms of explanation performance, efficiency, and generalizability.  ( 2 min )
    Dirichlet Energy Enhancement of Graph Neural Networks by Framelet Augmentation. (arXiv:2311.05767v1 [cs.LG])
    Graph convolutions have been a pivotal element in learning graph representations. However, recursively aggregating neighboring information with graph convolutions leads to indistinguishable node features in deep layers, which is known as the over-smoothing issue. The performance of graph neural networks decays fast as the number of stacked layers increases, and the Dirichlet energy associated with the graph decreases to zero as well. In this work, we introduce a framelet system into the analysis of Dirichlet energy and take a multi-scale perspective to leverage the Dirichlet energy and alleviate the over-smoothing issue. Specifically, we develop a Framelet Augmentation strategy by adjusting the update rules with positive and negative increments for low-pass and high-passes respectively. Based on that, we design the Energy Enhanced Convolution (EEConv), which is an effective and practical operation that is proved to strictly enhance Dirichlet energy. From a message-passing perspective, EEConv inherits multi-hop aggregation property from the framelet transform and takes into account all hops in the multi-scale representation, which benefits the node classification tasks over heterophilous graphs. Experiments show that deep GNNs with EEConv achieve state-of-the-art performance over various node classification datasets, especially for heterophilous graphs, while also lifting the Dirichlet energy as the network goes deeper.  ( 2 min )
    On Mergable Coresets for Polytope Distance. (arXiv:2311.05651v1 [cs.CG])
    We show that a constant-size constant-error coreset for polytope distance is simple to maintain under merges of coresets. However, increasing the size cannot improve the error bound significantly beyond that constant.  ( 2 min )
    Optimal simulation-based Bayesian decisions. (arXiv:2311.05742v1 [stat.ML])
    We present a framework for the efficient computation of optimal Bayesian decisions under intractable likelihoods, by learning a surrogate model for the expected utility (or its distribution) as a function of the action and data spaces. We leverage recent advances in simulation-based inference and Bayesian optimization to develop active learning schemes to choose where in parameter and action spaces to simulate. This allows us to learn the optimal action in as few simulations as possible. The resulting framework is extremely simulation efficient, typically requiring fewer model calls than the associated posterior inference task alone, and a factor of $100-1000$ more efficient than Monte-Carlo based methods. Our framework opens up new capabilities for performing Bayesian decision making, particularly in the previously challenging regime where likelihoods are intractable, and simulations expensive.  ( 2 min )
    Generating Pragmatic Examples to Train Neural Program Synthesizers. (arXiv:2311.05740v1 [cs.LG])
    Programming-by-example is the task of synthesizing a program that is consistent with a set of user-provided input-output examples. As examples are often an under-specification of one's intent, a good synthesizer must choose the intended program from the many that are consistent with the given set of examples. Prior work frames program synthesis as a cooperative game between a listener (that synthesizes programs) and a speaker (a user choosing examples), and shows that models of computational pragmatic inference are effective in choosing the user intended programs. However, these models require counterfactual reasoning over a large set of programs and examples, which is infeasible in realistic program spaces. In this paper, we propose a novel way to amortize this search with neural networks. We sample pairs of programs and examples via self-play between listener and speaker models, and use pragmatic inference to choose informative training examples from this sample.We then use the informative dataset to train models to improve the synthesizer's ability to disambiguate user-provided examples without human supervision. We validate our method on the challenging task of synthesizing regular expressions from example strings, and find that our method (1) outperforms models trained without choosing pragmatic examples by 23% (a 51% relative increase) (2) matches the performance of supervised learning on a dataset of pragmatic examples provided by humans, despite using no human data in training.  ( 2 min )
    Prompt Engineering a Prompt Engineer. (arXiv:2311.05661v1 [cs.CL])
    Prompt engineering is a challenging yet crucial task for optimizing the performance of large language models (LLMs). It requires complex reasoning to examine the model's errors, hypothesize what is missing or misleading in the current prompt, and communicate the task with clarity. While recent works indicate that LLMs can be meta-prompted to perform automatic prompt engineering, their potentials may not be fully untapped due to the lack of sufficient guidance to elicit complex reasoning capabilities in LLMs in the meta-prompt. In this work, we investigate the problem of "prompt engineering a prompt engineer" -- constructing a meta-prompt that more effectively guides LLMs to perform automatic prompt engineering. We introduce and analyze key components, such as a step-by-step reasoning template and context specification, which lead to improved performance. In addition, inspired by common optimization concepts such as batch size, step size and momentum, we introduce their verbalized counterparts to the meta-prompt and investigate their effects. Our final method, named PE2, finds a prompt that outperforms "let's think step by step" by 6.3% on the MultiArith dataset and 3.1% on the GSM8K dataset. To demonstrate its versatility, we apply PE2 to the Instruction Induction benchmark, a suite of counterfactual tasks, and a lengthy, real-world industrial prompt. In these settings, PE2 achieves strong performance and outperforms prior automatic prompt engineering baselines. Further, we show that PE2 makes meaningful and targeted prompt edits, amends erroneous or incomplete prompts, and presents non-trivial counterfactual reasoning abilities.  ( 2 min )
    A theory for the sparsity emerged in the Forward Forward algorithm. (arXiv:2311.05667v1 [cs.LG])
    This report explores the theory that explains the high sparsity phenomenon \citep{tosato2023emergent} observed in the forward-forward algorithm \citep{hinton2022forward}. The two theorems proposed predict the sparsity changes of a single data point's activation in two cases: Theorem \ref{theorem:1}: Decrease the goodness of the whole batch. Theorem \ref{theorem:2}: Apply the complete forward forward algorithm to decrease the goodness for negative data and increase the goodness for positive data. The theory aligns well with the experiments tested on the MNIST dataset.  ( 2 min )
    Deep Learning Architecture for Network-Efficiency at the Edge. (arXiv:2311.05739v1 [cs.NI])
    The growing number of AI-driven applications in the mobile devices has led to solutions that integrate deep learning models with the available edge-cloud resources; due to multiple benefits such as reduction in on-device energy consumption, improved latency, improved network usage, and certain privacy improvements, split learning, where deep learning models are split away from the mobile device and computed in a distributed manner, has become an extensively explored topic. Combined with compression-aware methods where learning adapts to compression of communicated data, the benefits of this approach have further improved and could serve as an alternative to established approaches like federated learning methods. In this work, we develop an adaptive compression-aware split learning method ('deprune') to improve and train deep learning models so that they are much more network-efficient (use less network resources and are faster), which would make them ideal to deploy in weaker devices with the help of edge-cloud resources. This method is also extended ('prune') to very quickly train deep learning models, through a transfer learning approach, that trades off little accuracy for much more network-efficient inference abilities. We show that the 'deprune' method can reduce network usage by 4x when compared with a split-learning approach (that does not use our method) without loss of accuracy, while also improving accuracy over compression-aware split-learning by 4 percent. Lastly, we show that the 'prune' method can reduce the training time for certain models by up to 6x without affecting the accuracy when compared against a compression-aware split-learning approach.  ( 3 min )
    EControl: Fast Distributed Optimization with Compression and Error Control. (arXiv:2311.05645v1 [math.OC])
    Modern distributed training relies heavily on communication compression to reduce the communication overhead. In this work, we study algorithms employing a popular class of contractive compressors in order to reduce communication overhead. However, the naive implementation often leads to unstable convergence or even exponential divergence due to the compression bias. Error Compensation (EC) is an extremely popular mechanism to mitigate the aforementioned issues during the training of models enhanced by contractive compression operators. Compared to the effectiveness of EC in the data homogeneous regime, the understanding of the practicality and theoretical foundations of EC in the data heterogeneous regime is limited. Existing convergence analyses typically rely on strong assumptions such as bounded gradients, bounded data heterogeneity, or large batch accesses, which are often infeasible in modern machine learning applications. We resolve the majority of current issues by proposing EControl, a novel mechanism that can regulate error compensation by controlling the strength of the feedback signal. We prove fast convergence for EControl in standard strongly convex, general convex, and nonconvex settings without any additional assumptions on the problem or data heterogeneity. We conduct extensive numerical evaluations to illustrate the efficacy of our method and support our theoretical findings.  ( 2 min )
    Long-Horizon Dialogue Understanding for Role Identification in the Game of Avalon with Large Language Models. (arXiv:2311.05720v1 [cs.CL])
    Deception and persuasion play a critical role in long-horizon dialogues between multiple parties, especially when the interests, goals, and motivations of the participants are not aligned. Such complex tasks pose challenges for current Large Language Models (LLM) as deception and persuasion can easily mislead them, especially in long-horizon multi-party dialogues. To this end, we explore the game of Avalon: The Resistance, a social deduction game in which players must determine each other's hidden identities to complete their team's objective. We introduce an online testbed and a dataset containing 20 carefully collected and labeled games among human players that exhibit long-horizon deception in a cooperative-competitive setting. We discuss the capabilities of LLMs to utilize deceptive long-horizon conversations between six human players to determine each player's goal and motivation. Particularly, we discuss the multimodal integration of the chat between the players and the game's state that grounds the conversation, providing further insights into the true player identities. We find that even current state-of-the-art LLMs do not reach human performance, making our dataset a compelling benchmark to investigate the decision-making and language-processing capabilities of LLMs. Our dataset and online testbed can be found at our project website: https://sstepput.github.io/Avalon-NLU/  ( 3 min )
    Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs. (arXiv:2311.05657v1 [cs.AI])
    We introduce Lumos, a novel framework for training language agents that employs a unified data format and a modular architecture based on open-source large language models (LLMs). Lumos consists of three distinct modules: planning, grounding, and execution. The planning module breaks down a task into a series of high-level, tool-agnostic subgoals, which are then made specific by the grounding module through a set of low-level actions. These actions are subsequently executed by the execution module, utilizing a range of off-the-shelf tools and APIs. In order to train these modules effectively, high-quality annotations of subgoals and actions were collected and are made available for fine-tuning open-source LLMs for various tasks such as complex question answering, web tasks, and math problems. Leveraging this unified data and modular design, Lumos not only achieves comparable or superior performance to current, state-of-the-art agents, but also exhibits several key advantages: (1) Lumos surpasses GPT-4/3.5-based agents in complex question answering and web tasks, while equalling the performance of significantly larger LLM agents on math tasks; (2) Lumos outperforms open-source agents created through conventional training methods and those using chain-of-thoughts training; and (3) Lumos is capable of effectively generalizing to unseen interactive tasks, outperforming larger LLM-based agents and even exceeding performance of specialized agents.  ( 2 min )
    Towards Instance-Optimality in Online PAC Reinforcement Learning. (arXiv:2311.05638v1 [stat.ML])
    Several recent works have proposed instance-dependent upper bounds on the number of episodes needed to identify, with probability $1-\delta$, an $\varepsilon$-optimal policy in finite-horizon tabular Markov Decision Processes (MDPs). These upper bounds feature various complexity measures for the MDP, which are defined based on different notions of sub-optimality gaps. However, as of now, no lower bound has been established to assess the optimality of any of these complexity measures, except for the special case of MDPs with deterministic transitions. In this paper, we propose the first instance-dependent lower bound on the sample complexity required for the PAC identification of a near-optimal policy in any tabular episodic MDP. Additionally, we demonstrate that the sample complexity of the PEDEL algorithm of \cite{Wagenmaker22linearMDP} closely approaches this lower bound. Considering the intractability of PEDEL, we formulate an open question regarding the possibility of achieving our lower bound using a computationally-efficient algorithm.  ( 2 min )
    Learning to Configure Separators in Branch-and-Cut. (arXiv:2311.05650v1 [math.OC])
    Cutting planes are crucial in solving mixed integer linear programs (MILP) as they facilitate bound improvements on the optimal solution. Modern MILP solvers rely on a variety of separators to generate a diverse set of cutting planes by invoking the separators frequently during the solving process. This work identifies that MILP solvers can be drastically accelerated by appropriately selecting separators to activate. As the combinatorial separator selection space imposes challenges for machine learning, we learn to separate by proposing a novel data-driven strategy to restrict the selection space and a learning-guided algorithm on the restricted space. Our method predicts instance-aware separator configurations which can dynamically adapt during the solve, effectively accelerating the open source MILP solver SCIP by improving the relative solve time up to 72% and 37% on synthetic and real-world MILP benchmarks. Our work complements recent work on learning to select cutting planes and highlights the importance of separator management.  ( 2 min )
    Mobile Internet Quality Estimation using Self-Tuning Kernel Regression. (arXiv:2311.05641v1 [stat.AP])
    Modeling and estimation for spatial data are ubiquitous in real life, frequently appearing in weather forecasting, pollution detection, and agriculture. Spatial data analysis often involves processing datasets of enormous scale. In this work, we focus on large-scale internet-quality open datasets from Ookla. We look into estimating mobile (cellular) internet quality at the scale of a state in the United States. In particular, we aim to conduct estimation based on highly {\it imbalanced} data: Most of the samples are concentrated in limited areas, while very few are available in the rest, posing significant challenges to modeling efforts. We propose a new adaptive kernel regression approach that employs self-tuning kernels to alleviate the adverse effects of data imbalance in this problem. Through comparative experimentation on two distinct mobile network measurement datasets, we demonstrate that the proposed self-tuning kernel regression method produces more accurate predictions, with the potential to be applied in other applications.  ( 2 min )
  • Open

    An Improved Relaxation for Oracle-Efficient Adversarial Contextual Bandits. (arXiv:2310.19025v2 [cs.LG] UPDATED)
    We present an oracle-efficient relaxation for the adversarial contextual bandits problem, where the contexts are sequentially drawn i.i.d from a known distribution and the cost sequence is chosen by an online adversary. Our algorithm has a regret bound of $O(T^{\frac{2}{3}}(K\log(|\Pi|))^{\frac{1}{3}})$ and makes at most $O(K)$ calls per round to an offline optimization oracle, where $K$ denotes the number of actions, $T$ denotes the number of rounds and $\Pi$ denotes the set of policies. This is the first result to improve the prior best bound of $O((TK)^{\frac{2}{3}}(\log(|\Pi|))^{\frac{1}{3}})$ as obtained by Syrgkanis et al. at NeurIPS 2016, and the first to match the original bound of Langford and Zhang at NeurIPS 2007 which was obtained for the stochastic case.  ( 2 min )
    Understanding Reconstruction Attacks with the Neural Tangent Kernel and Dataset Distillation. (arXiv:2302.01428v2 [cs.LG] UPDATED)
    Modern deep learning requires large volumes of data, which could contain sensitive or private information that cannot be leaked. Recent work has shown for homogeneous neural networks a large portion of this training data could be reconstructed with only access to the trained network parameters. While the attack was shown to work empirically, there exists little formal understanding of its effective regime which datapoints are susceptible to reconstruction. In this work, we first build a stronger version of the dataset reconstruction attack and show how it can provably recover the \emph{entire training set} in the infinite width regime. We then empirically study the characteristics of this attack on two-layer networks and reveal that its success heavily depends on deviations from the frozen infinite-width Neural Tangent Kernel limit. Next, we study the nature of easily-reconstructed images. We show that both theoretically and empirically, reconstructed images tend to "outliers" in the dataset, and that these reconstruction attacks can be used for \textit{dataset distillation}, that is, we can retrain on reconstructed images and obtain high predictive accuracy.  ( 2 min )
    Optimal Regret Is Achievable with Bounded Approximate Inference Error: An Enhanced Bayesian Upper Confidence Bound Framework. (arXiv:2201.12955v4 [cs.LG] UPDATED)
    Bayesian bandit algorithms with approximate Bayesian inference have been widely used in real-world applications. However, there is a large discrepancy between the superior practical performance of these approaches and their theoretical justification. Previous research only indicates a negative theoretical result: Thompson sampling could have a worst-case linear regret $\Omega(T)$ with a constant threshold on the inference error measured by one $\alpha$-divergence. To bridge this gap, we propose an Enhanced Bayesian Upper Confidence Bound (EBUCB) framework that can efficiently accommodate bandit problems in the presence of approximate inference. Our theoretical analysis demonstrates that for Bernoulli multi-armed bandits, EBUCB can achieve the optimal regret order $O(\log T)$ if the inference error measured by two different $\alpha$-divergences is less than a constant, regardless of how large this constant is. To our best knowledge, our study provides the first theoretical regret bound that is better than $o(T)$ in the setting of constant approximate inference error. Furthermore, in concordance with the negative results in previous studies, we show that only one bounded $\alpha$-divergence is insufficient to guarantee a sub-linear regret.  ( 3 min )
    Efficient Uncertainty Quantification and Reduction for Over-Parameterized Neural Networks. (arXiv:2306.05674v2 [stat.ML] UPDATED)
    Uncertainty quantification (UQ) is important for reliability assessment and enhancement of machine learning models. In deep learning, uncertainties arise not only from data, but also from the training procedure that often injects substantial noises and biases. These hinder the attainment of statistical guarantees and, moreover, impose computational challenges on UQ due to the need for repeated network retraining. Building upon the recent neural tangent kernel theory, we create statistically guaranteed schemes to principally \emph{characterize}, and \emph{remove}, the uncertainty of over-parameterized neural networks with very low computation effort. In particular, our approach, based on what we call a procedural-noise-correcting (PNC) predictor, removes the procedural uncertainty by using only \emph{one} auxiliary network that is trained on a suitably labeled dataset, instead of many retrained networks employed in deep ensembles. Moreover, by combining our PNC predictor with suitable light-computation resampling methods, we build several approaches to construct asymptotically exact-coverage confidence intervals using as low as four trained networks without additional overheads.  ( 2 min )
    An Accelerated Doubly Stochastic Gradient Method with Faster Explicit Model Identification. (arXiv:2208.06058v1 [cs.LG] CROSS LISTED)
    Sparsity regularized loss minimization problems play an important role in various fields including machine learning, data mining, and modern statistics. Proximal gradient descent method and coordinate descent method are the most popular approaches to solving the minimization problem. Although existing methods can achieve implicit model identification, aka support set identification, in a finite number of iterations, these methods still suffer from huge computational costs and memory burdens in high-dimensional scenarios. The reason is that the support set identification in these methods is implicit and thus cannot explicitly identify the low-complexity structure in practice, namely, they cannot discard useless coefficients of the associated features to achieve algorithmic acceleration via dimension reduction. To address this challenge, we propose a novel accelerated doubly stochastic gradient descent (ADSGD) method for sparsity regularized loss minimization problems, which can reduce the number of block iterations by eliminating inactive coefficients during the optimization process and eventually achieve faster explicit model identification and improve the algorithm efficiency. Theoretically, we first prove that ADSGD can achieve a linear convergence rate and lower overall computational complexity. More importantly, we prove that ADSGD can achieve a linear rate of explicit model identification. Numerically, experimental results on benchmark datasets confirm the efficiency of our proposed method.  ( 2 min )
    Adapting Fairness Interventions to Missing Values. (arXiv:2305.19429v2 [cs.LG] UPDATED)
    Missing values in real-world data pose a significant and unique challenge to algorithmic fairness. Different demographic groups may be unequally affected by missing data, and the standard procedure for handling missing values where first data is imputed, then the imputed data is used for classification -- a procedure referred to as "impute-then-classify" -- can exacerbate discrimination. In this paper, we analyze how missing values affect algorithmic fairness. We first prove that training a classifier from imputed data can significantly worsen the achievable values of group fairness and average accuracy. This is because imputing data results in the loss of the missing pattern of the data, which often conveys information about the predictive label. We present scalable and adaptive algorithms for fair classification with missing values. These algorithms can be combined with any preexisting fairness-intervention algorithm to handle all possible missing patterns while preserving information encoded within the missing patterns. Numerical experiments with state-of-the-art fairness interventions demonstrate that our adaptive algorithms consistently achieve higher fairness and accuracy than impute-then-classify across different datasets.  ( 2 min )
    TAKDE: Temporal Adaptive Kernel Density Estimator for Real-Time Dynamic Density Estimation. (arXiv:2203.08317v2 [stat.ML] UPDATED)
    Real-time density estimation is ubiquitous in many applications, including computer vision and signal processing. Kernel density estimation is arguably one of the most commonly used density estimation techniques, and the use of "sliding window" mechanism adapts kernel density estimators to dynamic processes. In this paper, we derive the asymptotic mean integrated squared error (AMISE) upper bound for the "sliding window" kernel density estimator. This upper bound provides a principled guide to devise a novel estimator, which we name the temporal adaptive kernel density estimator (TAKDE). Compared to heuristic approaches for "sliding window" kernel density estimator, TAKDE is theoretically optimal in terms of the worst-case AMISE. We provide numerical experiments using synthetic and real-world datasets, showing that TAKDE outperforms other state-of-the-art dynamic density estimators (including those outside of kernel family). In particular, TAKDE achieves a superior test log-likelihood with a smaller runtime.  ( 2 min )
    Nonparametric consistency for maximum likelihood estimation and clustering based on mixtures of elliptically-symmetric distributions. (arXiv:2311.06108v1 [math.ST])
    The consistency of the maximum likelihood estimator for mixtures of elliptically-symmetric distributions for estimating its population version is shown, where the underlying distribution $P$ is nonparametric and does not necessarily belong to the class of mixtures on which the estimator is based. In a situation where $P$ is a mixture of well enough separated but nonparametric distributions it is shown that the components of the population version of the estimator correspond to the well separated components of $P$. This provides some theoretical justification for the use of such estimators for cluster analysis in case that $P$ has well separated subpopulations even if these subpopulations differ from what the mixture model assumes.  ( 2 min )
    Distributionally Robust Skeleton Learning of Discrete Bayesian Networks. (arXiv:2311.06117v1 [cs.LG])
    We consider the problem of learning the exact skeleton of general discrete Bayesian networks from potentially corrupted data. Building on distributionally robust optimization and a regression approach, we propose to optimize the most adverse risk over a family of distributions within bounded Wasserstein distance or KL divergence to the empirical distribution. The worst-case risk accounts for the effect of outliers. The proposed approach applies for general categorical random variables without assuming faithfulness, an ordinal relationship or a specific form of conditional distribution. We present efficient algorithms and show the proposed methods are closely related to the standard regularized regression approach. Under mild assumptions, we derive non-asymptotic guarantees for successful structure learning with logarithmic sample complexities for bounded-degree graphs. Numerical study on synthetic and real datasets validates the effectiveness of our method. Code is available at https://github.com/DanielLeee/drslbn.  ( 2 min )
    PriorCVAE: scalable MCMC parameter inference with Bayesian deep generative modelling. (arXiv:2304.04307v3 [stat.ML] UPDATED)
    Recent advances have shown that GP priors, or their finite realisations, can be encoded using deep generative models such as variational autoencoders (VAEs). These learned generators can serve as drop-in replacements for the original priors during MCMC inference. While this approach enables efficient inference, it loses information about the hyperparameters of the original models, and consequently makes inference over hyperparameters impossible and the learned priors indistinct. To overcome this limitation, we condition the VAE on stochastic process hyperparameters. This allows the joint encoding of hyperparameters with GP realizations and their subsequent estimation during inference. Further, we demonstrate that our proposed method, PriorCVAE, is agnostic to the nature of the models which it approximates, and can be used, for instance, to encode solutions of ODEs. It provides a practical tool for approximate inference and shows potential in real-life spatial and spatiotemporal applications.  ( 2 min )
    Towards Instance-Optimality in Online PAC Reinforcement Learning. (arXiv:2311.05638v1 [stat.ML])
    Several recent works have proposed instance-dependent upper bounds on the number of episodes needed to identify, with probability $1-\delta$, an $\varepsilon$-optimal policy in finite-horizon tabular Markov Decision Processes (MDPs). These upper bounds feature various complexity measures for the MDP, which are defined based on different notions of sub-optimality gaps. However, as of now, no lower bound has been established to assess the optimality of any of these complexity measures, except for the special case of MDPs with deterministic transitions. In this paper, we propose the first instance-dependent lower bound on the sample complexity required for the PAC identification of a near-optimal policy in any tabular episodic MDP. Additionally, we demonstrate that the sample complexity of the PEDEL algorithm of \cite{Wagenmaker22linearMDP} closely approaches this lower bound. Considering the intractability of PEDEL, we formulate an open question regarding the possibility of achieving our lower bound using a computationally-efficient algorithm.  ( 2 min )
    Stabilizing Estimates of Shapley Values with Control Variates. (arXiv:2310.07672v2 [stat.ML] UPDATED)
    Shapley values are among the most popular tools for explaining predictions of blackbox machine learning models. However, their high computational cost motivates the use of sampling approximations, inducing a considerable degree of uncertainty. To stabilize these model explanations, we propose ControlSHAP, an approach based on the Monte Carlo technique of control variates. Our methodology is applicable to any machine learning model and requires virtually no extra computation or modeling effort. On several high-dimensional datasets, we find it can produce dramatic reductions in the Monte Carlo variability of Shapley estimates.  ( 2 min )
    Optimal Cooperative Multiplayer Learning Bandits with Noisy Rewards and No Communication. (arXiv:2311.06210v1 [cs.LG])
    We consider a cooperative multiplayer bandit learning problem where the players are only allowed to agree on a strategy beforehand, but cannot communicate during the learning process. In this problem, each player simultaneously selects an action. Based on the actions selected by all players, the team of players receives a reward. The actions of all the players are commonly observed. However, each player receives a noisy version of the reward which cannot be shared with other players. Since players receive potentially different rewards, there is an asymmetry in the information used to select their actions. In this paper, we provide an algorithm based on upper and lower confidence bounds that the players can use to select their optimal actions despite the asymmetry in the reward information. We show that this algorithm can achieve logarithmic $O(\frac{\log T}{\Delta_{\bm{a}}})$ (gap-dependent) regret as well as $O(\sqrt{T\log T})$ (gap-independent) regret. This is asymptotically optimal in $T$. We also show that it performs empirically better than the current state of the art algorithm for this environment.  ( 2 min )
    Minimum norm interpolation by perceptra: Explicit regularization and implicit bias. (arXiv:2311.06138v1 [stat.ML])
    We investigate how shallow ReLU networks interpolate between known regions. Our analysis shows that empirical risk minimizers converge to a minimum norm interpolant as the number of data points and parameters tends to infinity when a weight decay regularizer is penalized with a coefficient which vanishes at a precise rate as the network width and the number of data points grow. With and without explicit regularization, we numerically study the implicit bias of common optimization algorithms towards known minimum norm interpolants.  ( 2 min )
    Conditional Optimal Transport on Function Spaces. (arXiv:2311.05672v1 [math.OC])
    We present a systematic study of conditional triangular transport maps in function spaces from the perspective of optimal transportation and with a view towards amortized Bayesian inference. More specifically, we develop a theory of constrained optimal transport problems that describe block-triangular Monge maps that characterize conditional measures along with their Kantorovich relaxations. This generalizes the theory of optimal triangular transport to infinite-dimensional Hilbert spaces with general cost functions. We further tailor our results to the case of Bayesian inference problems and obtain regularity estimates on the conditioning maps from the prior to the posterior. Finally, we present numerical experiments that demonstrate the computational applicability of our theoretical results for amortized and likelihood-free Bayesian inverse problems.  ( 2 min )
    Fair Supervised Learning with A Simple Random Sampler of Sensitive Attributes. (arXiv:2311.05866v1 [stat.ML])
    As the data-driven decision process becomes dominating for industrial applications, fairness-aware machine learning arouses great attention in various areas. This work proposes fairness penalties learned by neural networks with a simple random sampler of sensitive attributes for non-discriminatory supervised learning. In contrast to many existing works that critically rely on the discreteness of sensitive attributes and response variables, the proposed penalty is able to handle versatile formats of the sensitive attributes, so it is more extensively applicable in practice than many existing algorithms. This penalty enables us to build a computationally efficient group-level in-processing fairness-aware training framework. Empirical evidence shows that our framework enjoys better utility and fairness measures on popular benchmark data sets than competing methods. We also theoretically characterize estimation errors and loss of utility of the proposed neural-penalized risk minimization problem.  ( 2 min )
    Deep quantum neural networks form Gaussian processes. (arXiv:2305.09957v2 [quant-ph] UPDATED)
    It is well known that artificial neural networks initialized from independent and identically distributed priors converge to Gaussian processes in the limit of large number of neurons per hidden layer. In this work we prove an analogous result for Quantum Neural Networks (QNNs). Namely, we show that the outputs of certain models based on Haar random unitary or orthogonal deep QNNs converge to Gaussian processes in the limit of large Hilbert space dimension $d$. The derivation of this result is more nuanced than in the classical case due to the role played by the input states, the measurement observable, and the fact that the entries of unitary matrices are not independent. An important consequence of our analysis is that the ensuing Gaussian processes cannot be used to efficiently predict the outputs of the QNN via Bayesian statistics. Furthermore, our theorems imply that the concentration of measure phenomenon in Haar random QNNs is worse than previously thought, as we prove that expectation values and gradients concentrate as $\mathcal{O}\left(\frac{1}{e^d \sqrt{d}}\right)$. Finally, we discuss how our results improve our understanding of concentration in $t$-designs.  ( 2 min )
    High-dimensional mixed-categorical Gaussian processes with application to multidisciplinary design optimization for a green aircraft. (arXiv:2311.06130v1 [math.OC])
    Multidisciplinary design optimization (MDO) methods aim at adapting numerical optimization techniques to the design of engineering systems involving multiple disciplines. In this context, a large number of mixed continuous, integer, and categorical variables might arise during the optimization process, and practical applications involve a significant number of design variables. Recently, there has been a growing interest in mixed-categorical metamodels based on Gaussian Process (GP) for Bayesian optimization. In particular, to handle mixed-categorical variables, several existing approaches employ different strategies to build the GP. These strategies either use continuous kernels, such as the continuous relaxation or the Gower distance-based kernels, or direct estimation of the correlation matrix, such as the exponential homoscedastic hypersphere (EHH) or the Homoscedastic Hypersphere (HH) kernel. Although the EHH and HH kernels are shown to be very efficient and lead to accurate GPs, they are based on a large number of hyperparameters. In this paper, we address this issue by constructing mixed-categorical GPs with fewer hyperparameters using Partial Least Squares (PLS) regression. Our goal is to generalize Kriging with PLS, commonly used for continuous inputs, to handle mixed-categorical inputs. The proposed method is implemented in the open-source software SMT and has been efficiently applied to structural and multidisciplinary applications. Our method is used to effectively demonstrate the structural behavior of a cantilever beam and facilitates MDO of a green aircraft, resulting in a 439-kilogram reduction in the amount of fuel consumed during a single aircraft mission.  ( 3 min )
    Statistical Guarantees for Variational Autoencoders using PAC-Bayesian Theory. (arXiv:2310.04935v2 [cs.LG] UPDATED)
    Since their inception, Variational Autoencoders (VAEs) have become central in machine learning. Despite their widespread use, numerous questions regarding their theoretical properties remain open. Using PAC-Bayesian theory, this work develops statistical guarantees for VAEs. First, we derive the first PAC-Bayesian bound for posterior distributions conditioned on individual samples from the data-generating distribution. Then, we utilize this result to develop generalization guarantees for the VAE's reconstruction loss, as well as upper bounds on the distance between the input and the regenerated distributions. More importantly, we provide upper bounds on the Wasserstein distance between the input distribution and the distribution defined by the VAE's generative model.  ( 2 min )
    Dataset Distillation with Convexified Implicit Gradients. (arXiv:2302.06755v2 [cs.LG] UPDATED)
    We propose a new dataset distillation algorithm using reparameterization and convexification of implicit gradients (RCIG), that substantially improves the state-of-the-art. To this end, we first formulate dataset distillation as a bi-level optimization problem. Then, we show how implicit gradients can be effectively used to compute meta-gradient updates. We further equip the algorithm with a convexified approximation that corresponds to learning on top of a frozen finite-width neural tangent kernel. Finally, we improve bias in implicit gradients by parameterizing the neural network to enable analytical computation of final-layer parameters given the body parameters. RCIG establishes the new state-of-the-art on a diverse series of dataset distillation tasks. Notably, with one image per class, on resized ImageNet, RCIG sees on average a 108\% improvement over the previous state-of-the-art distillation algorithm. Similarly, we observed a 66\% gain over SOTA on Tiny-ImageNet and 37\% on CIFAR-100.  ( 2 min )
    Orthogonal Polynomials Approximation Algorithm (OPAA):a functional analytic approach to estimating probability densities. (arXiv:2211.08594v2 [cs.LG] UPDATED)
    We present the new Orthogonal Polynomials Approximation Algorithm (OPAA), a parallelizable algorithm that solves two problems from a functional analytic approach: first, it finds a smooth functional estimate of a density function, whether it is normalized or not; second, the algorithm provides an estimate of the normalizing weight. In the context of Bayesian inference, OPAA provides an estimate of the posterior function as well as the normalizing weight, which is also known as the evidence. A core component of OPAA is a special transform of the square root of the joint distribution into a special functional space of our construct. Through this transform, the evidence is equated with the $L^2$ norm of the transformed function, squared. Hence, the evidence can be estimated by the sum of squares of the transform coefficients. The computations can be parallelized and completed in one pass. To compute the transform coefficients, OPAA proposes a new computational scheme leveraging Gauss--Hermite quadrature in higher dimensions. Not only does it avoid the potential high variance problem associated with random sampling methods, it also enables one to speed up the computation by parallelization, and significantly reduces the complexity by a vector decomposition.  ( 2 min )
    Greedy PIG: Adaptive Integrated Gradients. (arXiv:2311.06192v1 [cs.LG])
    Deep learning has become the standard approach for most machine learning tasks. While its impact is undeniable, interpreting the predictions of deep learning models from a human perspective remains a challenge. In contrast to model training, model interpretability is harder to quantify and pose as an explicit optimization problem. Inspired by the AUC softmax information curve (AUC SIC) metric for evaluating feature attribution methods, we propose a unified discrete optimization framework for feature attribution and feature selection based on subset selection. This leads to a natural adaptive generalization of the path integrated gradients (PIG) method for feature attribution, which we call Greedy PIG. We demonstrate the success of Greedy PIG on a wide variety of tasks, including image feature attribution, graph compression/explanation, and post-hoc feature selection on tabular data. Our results show that introducing adaptivity is a powerful and versatile method for making attribution methods more powerful.  ( 2 min )
    A Semi-Bayesian Nonparametric Estimator of the Maximum Mean Discrepancy Measure: Applications in Goodness-of-Fit Testing and Generative Adversarial Networks. (arXiv:2303.02637v2 [stat.ML] UPDATED)
    A classic inferential statistical problem is the goodness-of-fit (GOF) test. Such a test can be challenging when the hypothesized parametric model has an intractable likelihood and its distributional form is not available. Bayesian methods for GOF can be appealing due to their ability to incorporate expert knowledge through prior distributions. However, standard Bayesian methods for this test often require strong distributional assumptions on the data and their relevant parameters. To address this issue, we propose a semi-Bayesian nonparametric (semi-BNP) procedure in the context of the maximum mean discrepancy (MMD) measure that can be applied to the GOF test. Our method introduces a novel Bayesian estimator for the MMD, enabling the development of a measure-based hypothesis test for intractable models. Through extensive experiments, we demonstrate that our proposed test outperforms frequentist MMD-based methods by achieving a lower false rejection and acceptance rate of the null hypothesis. Furthermore, we showcase the versatility of our approach by embedding the proposed estimator within a generative adversarial network (GAN) framework. It facilitates a robust BNP learning approach as another significant application of our method. With our BNP procedure, this new GAN approach can enhance sample diversity and improve inferential accuracy compared to traditional techniques.  ( 3 min )
    Anytime-Valid Confidence Sequences for Consistent Uncertainty Estimation in Early-Exit Neural Networks. (arXiv:2311.05931v1 [cs.LG])
    Early-exit neural networks (EENNs) facilitate adaptive inference by producing predictions at multiple stages of the forward pass. In safety-critical applications, these predictions are only meaningful when complemented with reliable uncertainty estimates. Yet, due to their sequential structure, an EENN's uncertainty estimates should also be consistent: labels that are deemed improbable at one exit should not reappear within the confidence interval / set of later exits. We show that standard uncertainty quantification techniques, like Bayesian methods or conformal prediction, can lead to inconsistency across exits. We address this problem by applying anytime-valid confidence sequences (AVCSs) to the exits of EENNs. By design, AVCSs maintain consistency across exits. We examine the theoretical and practical challenges of applying AVCSs to EENNs and empirically validate our approach on both regression and classification tasks.  ( 2 min )
    An empirical evaluation of imbalanced data strategies from a practitioner's point of view. (arXiv:1810.07168v2 [cs.LG] UPDATED)
    This paper evaluates six strategies for mitigating imbalanced data: oversampling, undersampling, ensemble methods, specialized algorithms, class weight adjustments, and a no-mitigation approach referred to as the baseline. These strategies were tested on 58 real-life binary imbalanced datasets with imbalance rates ranging from 3 to 120. We conducted a comparative analysis of 10 under-sampling algorithms, 5 over-sampling algorithms, 2 ensemble methods, and 3 specialized algorithms across eight different performance metrics: accuracy, area under the ROC curve (AUC), balanced accuracy, F1-measure, G-mean, Matthew's correlation coefficient, precision, and recall. Additionally, we assessed the six strategies on altered datasets, derived from real-life data, with both low (3) and high (100 or 300) imbalance ratios (IR). The principal finding indicates that the effectiveness of each strategy significantly varies depending on the metric used. The paper also examines a selection of newer algorithms within the categories of specialized algorithms, oversampling, and ensemble methods. The findings suggest that the current hierarchy of best-performing strategies for each metric is unlikely to change with the introduction of newer algorithms.  ( 2 min )
    EControl: Fast Distributed Optimization with Compression and Error Control. (arXiv:2311.05645v1 [math.OC])
    Modern distributed training relies heavily on communication compression to reduce the communication overhead. In this work, we study algorithms employing a popular class of contractive compressors in order to reduce communication overhead. However, the naive implementation often leads to unstable convergence or even exponential divergence due to the compression bias. Error Compensation (EC) is an extremely popular mechanism to mitigate the aforementioned issues during the training of models enhanced by contractive compression operators. Compared to the effectiveness of EC in the data homogeneous regime, the understanding of the practicality and theoretical foundations of EC in the data heterogeneous regime is limited. Existing convergence analyses typically rely on strong assumptions such as bounded gradients, bounded data heterogeneity, or large batch accesses, which are often infeasible in modern machine learning applications. We resolve the majority of current issues by proposing EControl, a novel mechanism that can regulate error compensation by controlling the strength of the feedback signal. We prove fast convergence for EControl in standard strongly convex, general convex, and nonconvex settings without any additional assumptions on the problem or data heterogeneity. We conduct extensive numerical evaluations to illustrate the efficacy of our method and support our theoretical findings.  ( 2 min )
    Step and Smooth Decompositions as Topological Clustering. (arXiv:2311.05756v1 [math.ST])
    We investigate a class of recovery problems for which observations are a noisy combination of continuous and step functions. These problems can be seen as non-injective instances of non-linear ICA with direct applications to image decontamination for magnetic resonance imaging. Alternately, the problem can be viewed as clustering in the presence of structured (smooth) contaminant. We show that a global topological property (graph connectivity) interacts with a local property (the degree of smoothness of the continuous component) to determine conditions under which the components are identifiable. Additionally, a practical estimation algorithm is provided for the case when the contaminant lies in a reproducing kernel Hilbert space of continuous functions. Algorithm effectiveness is demonstrated through a series of simulations and real-world studies.  ( 2 min )
    Optimal simulation-based Bayesian decisions. (arXiv:2311.05742v1 [stat.ML])
    We present a framework for the efficient computation of optimal Bayesian decisions under intractable likelihoods, by learning a surrogate model for the expected utility (or its distribution) as a function of the action and data spaces. We leverage recent advances in simulation-based inference and Bayesian optimization to develop active learning schemes to choose where in parameter and action spaces to simulate. This allows us to learn the optimal action in as few simulations as possible. The resulting framework is extremely simulation efficient, typically requiring fewer model calls than the associated posterior inference task alone, and a factor of $100-1000$ more efficient than Monte-Carlo based methods. Our framework opens up new capabilities for performing Bayesian decision making, particularly in the previously challenging regime where likelihoods are intractable, and simulations expensive.  ( 2 min )
    Structured Transforms Across Spaces with Cost-Regularized Optimal Transport. (arXiv:2311.05788v1 [cs.LG])
    Matching a source to a target probability measure is often solved by instantiating a linear optimal transport (OT) problem, parameterized by a ground cost function that quantifies discrepancy between points. When these measures live in the same metric space, the ground cost often defaults to its distance. When instantiated across two different spaces, however, choosing that cost in the absence of aligned data is a conundrum. As a result, practitioners often resort to solving instead a quadratic Gromow-Wasserstein (GW) problem. We exploit in this work a parallel between GW and cost-regularized OT, the regularized minimization of a linear OT objective parameterized by a ground cost. We use this cost-regularized formulation to match measures across two different Euclidean spaces, where the cost is evaluated between transformed source points and target points. We show that several quadratic OT problems fall in this category, and consider enforcing structure in linear transform (e.g. sparsity), by introducing structure-inducing regularizers. We provide a proximal algorithm to extract such transforms from unaligned data, and demonstrate its applicability to single-cell spatial transcriptomics/multiomics matching tasks.  ( 2 min )
    Differentiable VQ-VAE's for Robust White Matter Streamline Encodings. (arXiv:2311.06212v1 [stat.ML])
    Given the complex geometry of white matter streamlines, Autoencoders have been proposed as a dimension-reduction tool to simplify the analysis streamlines in a low-dimensional latent spaces. However, despite these recent successes, the majority of encoder architectures only perform dimension reduction on single streamlines as opposed to a full bundle of streamlines. This is a severe limitation of the encoder architecture that completely disregards the global geometric structure of streamlines at the expense of individual fibers. Moreover, the latent space may not be well structured which leads to doubt into their interpretability. In this paper we propose a novel Differentiable Vector Quantized Variational Autoencoder, which are engineered to ingest entire bundles of streamlines as single data-point and provides reliable trustworthy encodings that can then be later used to analyze streamlines in the latent space. Comparisons with several state of the art Autoencoders demonstrate superior performance in both encoding and synthesis.  ( 2 min )
    Improvements on Uncertainty Quantification for Node Classification via Distance-Based Regularization. (arXiv:2311.05795v1 [cs.LG])
    Deep neural networks have achieved significant success in the last decades, but they are not well-calibrated and often produce unreliable predictions. A large number of literature relies on uncertainty quantification to evaluate the reliability of a learning model, which is particularly important for applications of out-of-distribution (OOD) detection and misclassification detection. We are interested in uncertainty quantification for interdependent node-level classification. We start our analysis based on graph posterior networks (GPNs) that optimize the uncertainty cross-entropy (UCE)-based loss function. We describe the theoretical limitations of the widely-used UCE loss. To alleviate the identified drawbacks, we propose a distance-based regularization that encourages clustered OOD nodes to remain clustered in the latent space. We conduct extensive comparison experiments on eight standard datasets and demonstrate that the proposed regularization outperforms the state-of-the-art in both OOD detection and misclassification detection.  ( 2 min )

  • Open

    [Project] Which text representation technique to choose?
    I have a project to do on a movie review classifier, I'm leaning towards the "bag of words" technique, thoughts? submitted by /u/The_Middle_Child_ [link] [comments]  ( 9 min )
    [R] [ICLR] Is it okay to reference an answer to another reviewer in a reviewer's response?
    I am writing the responses to my reviewers for the ICLR. I had two reviewers asking a very similar question, and I'm wondering whether it is good to say to one of them something like "please refer to answer (4) in my response to reviewer xxxx" and then continue to tailor my answer to this reviewer, based on the reasoning offered in (4) for reviewer xxxx. Is this valid or should I just copy/paste the answer? submitted by /u/howtorewriteaname [link] [comments]  ( 9 min )
    [D] How are Statistics Journals Viewed in Machine Learning Community for Research Jobs?
    Hi all, I'm a statistics PhD researching statistical machine learning (think Gaussian Processes). I have a few papers in the pipeline at solid statistics publications (Journal of the American Statistical Association/Journal of Computational and Graphical Statistics). Do these journals have any weight/name recognition in the ML world or are they unknown/chopped liver? Thanks! submitted by /u/herodotick [link] [comments]  ( 9 min )
    [R] JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models
    Paper: https://arxiv.org/abs/2311.05997 Blog: https://craftjarvis-jarvis1.github.io/ Abstract: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. In our experiments, JARVIS-1 exhibits nearly perfect performances across over 200 varying tasks from the Minecraft Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task. This represents a significant increase up to 5 times compared to previous records. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy. submitted by /u/MysteryInc152 [link] [comments]  ( 9 min )
    [D] Tutorials on Generative Adversarial Networks ?
    I know there are plenty of tutorials online, but most of them just skim through the topic and don't have much in detail, some of them even contradict each other. I have read the original paper, but very technical and difficult to understand as a newbie. Are there any tutorials that explain GANs in details but are still newbie-friendly? Thanks submitted by /u/Electronic_Ant2706 [link] [comments]  ( 9 min )
    [D] Course Enrollment
    Hello, This is my first time posting so here goes nothing. I want to train a model to recommend to university students which courses to enroll in for their next semester, I have all the data I need but I don’t know how to structure it to start, I have 3 different data sheets containing what the students have enrolled in with their grades for each semester(they are all studying the same major), their high school finals grades ,and the last sheet is basically the study plan showing each course and what are it’s prerequisites and like how many credit hours it counts for. My question is how do I recommend the students which courses to recommend if they give me an input of how many credit hours they want to enroll in and also have it be personalized, any information would be helpful. submitted by /u/Between5and20 [link] [comments]  ( 9 min )
    [R] Simplifying Transformer Blocks
    Paper: https://arxiv.org/abs/2311.01906 GitHub: https://github.com/bobby-he/simplified_transformers Abstract: A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable. In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput, and using 15% fewer parameters. https://preview.redd.it/6pz53ro6260c1.png?width=1129&format=png&auto=webp&s=ce9dad8de6cac575970e38d93a35786fe8880506 submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [P] Fraud Detection | e-commerce
    I am building a system to separate fraudulent transactions, these will be then manually verified, helping me in turn build a labelled dataset over time. For now I have transaction data and customer behavior Information. I intend to do this like this:The possible fraud cases include:- Abuse of all cashbacks and discounts ( Coupons / Vouchers / Auto Refund)- Retailing - Acquiring sensitive SKUs Since I don't really have labelled data, I am going with the unsupervised learning approach (isolation forest).I plan on having 3 modules : Users, SKUs, Localities For the last 2 I am suffering with setting meaningful thresholds, I standardized slope of sales trend and intercept, then divided them to get a compound variable which I am using to compare. Please share thoughts and or Resources. submitted by /u/common_happen143 [link] [comments]  ( 9 min )
    [D] Data visualization tool as local server
    Hello community! I am reading some of the mechanistic interpretability papers and I am pretty interested in this field and wants to get my hand on. This field, possibly with many other fields that involve explanation/interpretability, requires a ton of visualization, of weights, of activation, of features etc. Thus I think a more tailored tool than plain plotly/matplotlib/seaborn is needed. What I am looking for is a tool that can deploy as a local server so that I can seamlessly navigate through a LOT of visualizations, ideally also able to interact like plotly. I can for example save a ~100MB packed image via matplotlib, but opening and viewing such a giant image is a pain and I lose interactivity. One example is openai's microsocpe, where you can view each weight of various models, as well as feature visualization of the weight. But of course for my own purpose it need not to be that fancy. As someone who does not know a lot on data visualization beyond matplotlib, I really appreciate any pointers! Thanks! submitted by /u/tt19234 [link] [comments]  ( 9 min )
    Deployment of RVC pipeline to Sagemaker Serverless Endpoint [P]
    Really struggling to make this work. Does anyone have any insights? Would happily pay good money for some guidance. submitted by /u/torstah [link] [comments]  ( 9 min )
    [Research] Seeking Structured Strategies for Neural Network Design
    Hi everyone, I'm currently a PhD student diving into how transformer models can be applied to biological sequence data. My task involves iterating on a deep-learning base model to improve its accuracy. However, I seem stuck in a trial-and-error loop without a clear-cut strategy. I'm reaching out to see if there's a more systematic approach to tweaking neural network architectures that are commonly accepted in the field. Are there broad-stroke principles or frameworks that you rely on, regardless of the project specifics? Any pointers on more structured methods or resources for guiding these decisions would be super helpful. submitted by /u/palset [link] [comments]  ( 9 min )
    [D] Is there a good tool for inspecting a video dataset?
    Hi I'm creating a video dataset composed of small clips and i would like to use a tool for inspecting videos, that is, something in which i can easily play them and delete them directly or send them to another folder. Any ideas or suggestions? submitted by /u/aendrs [link] [comments]  ( 9 min )
    [D] Should Tensorflow library builds/contributions be this cumbersome?
    I am working with Edge AI, and have work projects that are using TFLM to run networks on microcontrollers. I have found that for some of the optimizations, such as operator fusion of custom activations, I need to adjust the Tensorflow library logic such as MLIR/TFL-graph-builder in order to export a .tflite execution graph that can be interpreted by the microcontroller library(TFLM/CMSIS/X-CUBE-AI/TinyEngine/TVM). This requires me to recompile the tensorflow repo locally using Bazel after every addition to the library. This takes from 20 min to 5 hours depending on what is cached. Now sure, the whole thing does not have to be compiled from scratch every time, but this feels like such a cumbersome and slow development process. Is this the reality of it, or am I missing something? submitted by /u/SlayahhEUW [link] [comments]  ( 9 min )
    [D] Don't know what to do
    Hello, I'm a junior developer at a small company. I'm in a tough situation and looking for opinions. The company hired me for an ML/data-oriented position, basically a backend developer with ML/data focus. Currently, I'm tasked with a full computer vision project, which means that I have to do almost everything on my own: object detection, analysis, database storage, data extraction, etc., all in real-time. I received a subpar laptop, making testing and regular work difficult. Despite limited support from senior developers, the project works, but not flawlessly. How can they expect me to build this? I've asked for help, but the response is delayed, and there's no mentoring. Recently, I expressed my struggle, and they questioned why it's not finished. I fear they might let me go. Some senior friends of mine advised me to leave. I'm unsure; any advice? Also, my salary is 50k which as I know, rather low. submitted by /u/windonwind [link] [comments]  ( 9 min )
    [R] A Practical Approach to Novel Class Discovery in Tabular Data
    Detecting new, unlabeled data classes is a big opportunity in AI, especially for models trained on known, labeled data. This challenge is crucial in a world where data is constantly evolving. A recent paper takes a novel approach to this. They use known class labels to help cluster unlabeled data into new categories. They test 3 methods - extending k-means and spectral clustering to leverage known data, and a custom projection-based network. The projection network avoids overfitting and works best overall, even when the number of novel classes is uncertain. It outperforms standard techniques and a more complex benchmark model called TabularNCD. This research shows that novel class discovery in tabular data is feasible without unrealistic access to novel class information during training. While there are limitations, it's progress in an important open problem. Applications could range from identifying new categories of census data and insurance claims to customer segments over time. The techniques might even extend to time series or graph data. Overall, this paper moves us closer to more flexible, general AI that can adapt and learn in an open-ended, lifelong manner. TLDR: AI can discover hidden relationships in tabular data Full summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    ML course on Udemy - Opinion [D]
    Hi everyone, I am privately very interested in this field and have a very basic understanding of how it kind of works, but now I really want to make the move and actually learn this. I have found a bunch of courses on Udemy and bought a course on black Friday called "Complete Machine Learning and Data Science Bootcamp 2023" by Andrei Neagoie How legit or useful is this course? It's a hands-on boot camp with practical learning of Python basics and understanding its depth. It has a total of 44 hours, and I really like it so far but also want some guidance on how to get into this whole field. Any recommendation is very welcome, thanks! :) submitted by /u/TheOceanIAm [link] [comments]  ( 9 min )
    [D] Machine Learning/AI companies that sponsor or fund non-profits
    Hello, I work at a non-profit that is aimed for machine learning and/or related grad students in developing countries. We are based in North América and currently looking for sponsorships, preferably from a company with the same goals in machine learning and artificial intelligence that we have. Do you know of any companies with a CSR (corporate social responsibility) program? Thank you! (I'm not mentioning the non-profit as I'm worried the mods will take it down as self-promotion but if anyone wants me to share please let me know!) submitted by /u/glitteryCranberry [link] [comments]  ( 9 min )
    [P] Whisper large-v3 API
    I'm working on making it easier to use open-source AI models by providing simple APIs. Last week, OpenAI released Whisper v3, which showed improved performance across all its 100+ supported languages. We tried to optimize it as much as possible to be able to offer it for $0.0028 per minute (vs $0.006 on OpenAI). I hope it's helpful for someone: https://www.lemonfox.ai/apis/speech-to-text submitted by /u/ingojoseph [link] [comments]  ( 9 min )
    Data Scientist to Machine Learning Engineer - Do I Need a Portfolio [D]
    With 7+ years of experience in data science and data analytics roles, would it be necessary to create a portfolio to begin applying to Machine Learning Engineer positions? If I already have the experience using sklearn/tf/keras/etc. in my current role of >2 years, how important is a portfolio when applying to these MLE roles? submitted by /u/Fiboniz [link] [comments]  ( 9 min )
    [P] Need advice on building a YOLOv8 + SAHI program
    Need advice on building and object detection program. I plan to use YOLOv8 with SAHI (slicing aided hyper inference) on RTSP stream of an IP camera. The detection stream has to be saved in realtime. I have a Jetson Orin AGX 64gb to utilize the NVDEC (HW engine) to decode the h.265 video. What is the easiest possible way to achieve this. Thank you very much for your advice. submitted by /u/These_Consequence_52 [link] [comments]  ( 9 min )
    [D] Gen-AI/LLM - Interview prep
    Hello folks, I have an interview call later this week which the work is regarding implementing generative AI within the companies workflow. Using LLMs with finetuning/in-context learning using system logs etc kind of stuff. I have studied machine learning, worked for few years now as well. Have good understanding of those stuff but never tried fine tuning hands-on. I'm worked majority into computer-vision applications but think that I lagged a bit on the LLM side. Any suggestions, recommeded papers, courses, videos I could go through? Thanks! submitted by /u/ade17_in [link] [comments]  ( 9 min )
    [D] Logit Laplace Reconstruction Loss
    Was reading through the original DALL-E paper (https://arxiv.org/pdf/2102.12092.pdf) and found their appendix A.3 pretty interesting. The motivation behind the logit Laplace loss totally makes sense, but if you plot the negative log of the PDF of their logit Laplace function, it's not strictly nonnegative as a function of x. Does that mean the authors are allowing for a potentially negative loss with this reformulation? submitted by /u/clywac2 [link] [comments]  ( 9 min )
    [R] P-MMF: Provider Max-min Fairness Re-ranking in Recommender System
    In this paper, we address the issue of recommending fairly from the aspect of providers, which has become increasingly essential in multistakeholder recommender systems. Existing studies on provider fairness usually focused on designing proportion fairness (PF) metrics that first consider systematic fairness. However, sociological researches show that to make the market more stable, max-min fairness (MMF) is a better metric. The main reason is that MMF aims to improve the utility of the worst ones preferentially, guiding the system to support the providers in weak market positions. When applying MMF to recommender systems, how to balance user preferences and provider fairness in an online recommendation scenario is still a problem. The central question revolves around the type of fairness that holds significance within recommender systems and other machine-learning systems. Experimental results also show that P-MMF can retain small computational costs on a corpus with the large number of items. This paper was accepted by TheWebConf 2023 and awarded as SpotLight-Best Paper Candidate (https://dl.acm.org/doi/abs/10.1145/3543507.3583296). The central question revolves around the type of fairness that holds significance among different stakeholders in recommender systems and other machine-learning systems. ​ ​ Two types of fairness ​ Harm for unfairness ​ MMF and PF ​ P-MMF can pareto-dominate baselines ​ ​ submitted by /u/XuChen0427 [link] [comments]  ( 9 min )
    [R] Data Filtering Networks
    Paper: https://arxiv.org/abs/2309.17425 Models: https://github.com/mlfoundations/open_clip Results table: https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv X thread: https://twitter.com/gabriel_ilharco/status/1721905861369762096 Abstract: Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art CLIP models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 84.4% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data. ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
  • Open

    Live AI Course (Sailea Non-profit) (Forever Free Course)
    >Sailea is a student run non-profit that does not charge for any of its services Join the Second lesson of SAILea’s course on the Principals of AI! 🌳 Covers: Decision Trees 🗓️ November 18th ⏰ 7:00-8:00PM EST Why Sailea? Only course targeted at high schoolers Free Forever Join Us Now! 👉 (signup form) https://docs.google.com/forms/d/e/1FAIpQLSfQGCeZClTdF6zeIQ-RtbOGR582bb1slc3oR0zG2J7j1v5RHg/viewform?usp=sf_link 🌳 Register today, get involved in the community and grow your knowledge! submitted by /u/Envoy-Insc [link] [comments]  ( 9 min )
    OpenAI New Products: A Deep Dive
    submitted by /u/Overflame [link] [comments]  ( 1 min )
    🪷Chinese Perspectives on OpenAI DevDay, NVIDIA‘s New AI Chips for China, and Zen Buddhism Meets AI
    submitted by /u/trcytony [link] [comments]  ( 1 min )
    Harry Potter 80s sitcom (deepfake)
    submitted by /u/vaddymusic [link] [comments]  ( 1 min )
    AI Is a Terrifying Purveyor of Bullshit. Next Up: Fake Science
    AI tools like chatGPT and Google's AI tool are dangerous as they can generate misinformation and disinformation, making it difficult to determine what is true. These tools can invent thoughts and create fake stories that seem plausible. AI models like ChatGPT and GPT-4 ADA can generate persuasive but false statements and even create fake datasets to support preordained conclusions. This poses a threat to scientific research and the ability to distinguish between real and fake information. The proliferation of AI-generated content makes it increasingly challenging to parse authentic information from AI-generated bullshit. Source : https://www.lastwordonnothing.com/2023/11/10/ai-is-a-terrifying-purveyor-of-bullshit-next-up-fake-science/ submitted by /u/NuseAI [link] [comments]
    Ai that automatically finds potentially interesting parts of video
    I've seen somewhere in tiktok or yt shorts a guy talking about an ai website in which you paste a yt link and ai automatically finding potentially interesting parts of video which you then can split and make short tiktok from it. Is it possible to find this website? Tried searching on yt, ig, tiktok submitted by /u/dusy4 [link] [comments]  ( 9 min )
    AI in Medicine: how should it be regulated? What should states do?
    Do you have any policy proposals you think would help ai in medicine (aim) to be safer, less biased, better in any way? What would you do if you could impact state level policy on aim? submitted by /u/mbfunke [link] [comments]  ( 9 min )
    AI Headshots
    I am looking for a free ai professional headshot for my LinkedIn page. I just need one picture and I don’t want to spend any money because. I’m low on funds. Any advice would be great! Thanks submitted by /u/tyrannosaurwes [link] [comments]  ( 1 min )
    Will user agents help with integrating all niche radiology tools to form a more copilot experience?
    https://python.langchain.com/docs/modules/agents/ There are a lot of niche AI programs that outperform radiologists in very specific tasks. But no one program that integrates them all. I wonder if a user agent can help with this since the reasoning engine of a LLM can determine when to use what tool. submitted by /u/derpgod123 [link] [comments]  ( 9 min )
    Looking for paper: generalization of LLM as probabilistic generation of tokens + validation
    I can't find the paper but it's your of important to me. It had the following content; * theoretical considerations of viewing LLM as way to probabilistic generate tokens * data flow of tokens was also considered, realized with nodes which can generate tokens or validate them * generalization of various "agentic" papers which did validation of result with LLM as data flow nodes to validate I can remember two papers which did this sort of thing. I can't find neither of them. I am looking for the one paper which really did use nodes. submitted by /u/squareOfTwo [link] [comments]  ( 9 min )
    Pokemon Mecha
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 1 min )
    Will Grok overrun chatGPT?
    We all saw Grok and its okayish. Do you think it'll get considerably better taking into account elon musk's past exploits? submitted by /u/redblade678 [link] [comments]
    One-Minute Daily AI News 11/12/2023
    Nvidia CEO Jensen Huang says his AI powerhouse is ‘always in peril’ despite a $1.1 trillion market cap: ‘We feel it’.[1] OpenAI Announces New Data Partnerships to Enhance AI Training.[2] AI Could Make Grand Theft Auto NPCs ‘Really Interesting and Fun’, Says Take-Two CEO.[3] In a significant industry shift, Baidu Inc. reportedly ordered AI chips from Huawei moving away from long-time supplier Nvidia.[4] Sources: [1] https://finance.yahoo.com/news/nvidia-ceo-jensen-huang-says-225059737.html [2] https://innovation-village.com/openai-announces-new-data-partnerships-to-enhance-ai-training/ [3] https://www.pushsquare.com/news/2023/11/ai-could-make-grand-theft-auto-npcs-really-interesting-and-fun-says-take-two-ceo [4] https://www.benzinga.com/news/23/11/35636991/blow-for-nvidia-baidu-reportedly-switches-to-huawei-for-ai-chips-ditching-its-long-time-supplier submitted by /u/Excellent-Target-847 [link] [comments]
  • Open

    Need help training a PPO NN to learn how to play my deckbuilding game
    (Posted originally in r/artificial, but didn't get any responses there; so I'm trying here, which seems much more relevant :) I have a roguelike deckbuilding game I want to train an agent to play using pure unsupervised RL; I chose PPO as I understand (to my amateur knowledge) that is the most fitting algorithm. I have a sizeable categorical space to send in (basically, what cards are in the deck and which cards are being offered to pick), and I need the agent to learn the best picks. I attempted to use an embedding layer and input the cards the player has + given cards + numerical data (added to the embedding output). I tried playing around with various hyperparameters, but so far, I have not been able to generate any learning. (I posted originally in d be greatly, but I didn't get any responses there, so I'm trying here, which seems much more relevant :) submitted by /u/Jagerjj [link] [comments]  ( 1 min )
    How to stabilize Pybullet environment
    Hello everyone I'm quite new to pybullet, and working with a URDF file i found only to control the hexapods shoulders and biceps, however the simulator is acting really funky. The bot is flying up a lot into the air, skittering on the ground. I wanted to reach out here and see if anyone knows the most important parameters to tinker with to fix this Here is a video of what im talking about https://youtu.be/DrOKz9xs2I8 any tips and tricks would be greatly appreciated submitted by /u/bbri826 [link] [comments]  ( 9 min )
    PPO agent not learning
    Do have a go at the problem. ​ I have a custom Boid flocking environment in OpenAI Gym using PPO from StableBaselines3. I wanted it to achieve flocking similar to Reynold's model(Video) or close enough, but it isn't learning. ​ I have adjusted the calculate_reward my model uses to be similar but not seeing any apparent improvement. ​ Reynold's model equations: Reynold's Model ​ My results after 100000 timesteps of training: My result so far: https://drive.google.com/file/d/1jAlGrGmpt2nUspBtoZcN7yJLHFQe4CAy/view?usp=drive_link ​ TensorBoard Graphs TensorBoard Reward Function def calculate_reward(self): total_reward = 0 cohesion_reward = 0 separation_reward = 0 collision_penalty = 0 velocity_matching_reward = 0 for agent in self.agents: for other in self.agents: if agent != other: distance = np.linalg.norm(agent.position - other.position) # if distance <= 50: # cohesion_reward += 5 if distance < SimulationVariables["NeighborhoodRadius"]: separation_reward -= 100 velocity_matching_reward += np.linalg.norm(np.mean([other.velocity for other in self.agents], axis=0) - agent.velocity) if distance < SimulationVariables["SafetyRadius"]: collision_penalty -= 1000 total_reward = separation_reward + velocity_matching_reward + collision_penalty # print(f"Total: {total_reward}, Cohesion: {cohesion_reward}, Separation: {separation_reward}, Velocity Matching: {velocity_matching_reward}, Collision: {collision_penalty}") return total_reward, cohesion_reward, separation_reward, collision_penalty Complete code: Code P.S ANY help is appreciated, I have tried different approaches but the level of desperation is increasing lol. submitted by /u/Sadboi1010 [link] [comments]  ( 9 min )
  • Open

    New Class of Accelerated, Efficient AI Systems Mark the Next Era of Supercomputing
    NVIDIA today unveiled at SC23 the next wave of technologies that will lift scientific and industrial research centers worldwide to new levels of performance and energy efficiency. “NVIDIA hardware and software innovations are creating a new class of AI supercomputers,” said Ian Buck, vice president of the company’s high performance computing and hyperscale data center Read article >  ( 9 min )
    Gen AI for the Genome: LLM Predicts Characteristics of COVID Variants
    A widely acclaimed large language model for genomic data has demonstrated its ability to generate gene sequences that closely resemble real-world variants of SARS-CoV-2, the virus behind COVID-19. Called GenSLMs, the model, which last year won the Gordon Bell special prize for high performance computing-based COVID-19 research, was trained on a dataset of nucleotide sequences Read article >  ( 6 min )
    Researchers Poised for Advances With NVIDIA CUDA Quantum
    Michael Kuehn and Davide Vodola are taking to new heights work that’s pioneering quantum computing for the world’s largest chemical company. The BASF researchers are demonstrating how a quantum algorithm can see what no traditional simulation can — key attributes of NTA, a compound with applications that include removing toxic metals like iron from a Read article >  ( 6 min )
    NVIDIA Grace Hopper Superchip Powers 40+ AI Supercomputers Across Global Research Centers, System Makers, Cloud Providers
    Dozens of new supercomputers for scientific computing will soon hop online, powered by NVIDIA’s breakthrough GH200 Grace Hopper Superchip for giant-scale AI and high performance computing. The NVIDIA GH200 enables scientists and researchers to tackle the world’s most challenging problems by accelerating complex AI and HPC applications running terabytes of data. At the SC23 supercomputing Read article >  ( 6 min )
  • Open

    Implement real-time personalized recommendations using Amazon Personalize
    At a basic level, Machine Learning (ML) technology learns from data to make predictions. Businesses use their data with an ML-powered personalization service to elevate their customer experience. This approach allows businesses to use data to derive actionable insights and help grow their revenue and brand loyalty. Amazon Personalize accelerates your digital transformation with ML, […]  ( 8 min )
    Improve LLM responses in RAG use cases by interacting with the user
    One of the most common applications of generative AI and large language models (LLMs) is answering questions based on a specific external knowledge corpus. Retrieval-Augmented Generation (RAG) is a popular technique for building question answering systems that use an external knowledge base. To learn more, refer to Build a powerful question answering bot with Amazon […]  ( 7 min )
  • Open

    Ruzsa distance and Ruzsa entropy warmup
    This summer I wrote about Ruzsa distance and Ruzsa entropy. That earlier post would make a good warmup for reading Terence Tao’s latest post On a conjecture of Martin. Ruzsa distance and Ruzsa entropy warmup first appeared on John D. Cook.  ( 4 min )
    How to memorize a Bitcoin address
    The latest episode of Darknet Diaries interviews someone using the pseudonym Default. He says in the interview that he had nearly a thousand Bitcoins (about $36 M) in a wallet stored on an external hard drive that was seized by federal agents when they raided his home. Default went to prison for five years for […] How to memorize a Bitcoin address first appeared on John D. Cook.  ( 7 min )
  • Open

    Future-proofing advanced data warehouses
    Introduction Data warehouses are the linchpins of modern data infrastructure, integral for businesses that rely on informed decision-making. They act as centralized repositories where diverse data from various sources is collated, transformed, and stored for analysis and reporting. In essence, a data warehouse is a structured data haven designed for query and analysis, providing crucial… Read More »Future-proofing advanced data warehouses The post Future-proofing advanced data warehouses appeared first on Data Science Central.  ( 23 min )
  • Open

    Simplifying Transformer Blocks
    Paper: https://arxiv.org/abs/2311.01906 GitHub: https://github.com/bobby-he/simplified_transformers Abstract: A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable. In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput, and using 15% fewer parameters. https://preview.redd.it/eze7pk3lu00c1.png?width=1129&format=png&auto=webp&s=9a1ecebee5a98ab385e69c619c1f91bed2985236 submitted by /u/APaperADay [link] [comments]  ( 9 min )
    Generative vs. Discriminative AI
    submitted by /u/Neurosymbolic [link] [comments]  ( 9 min )

  • Open

    [D] What do you think about representing each neuron and each synapse as a small neural network that will form a larger neural network?
    For example, as input data, the synapse neural network will receive a forward propagation vector from the pre-neuron and a back propagation vector from the post-neuron. I think a model with a this approach can be taught to form short-term and long-term memory. Maybe we can use the output of the last gpt block to train this model. submitted by /u/Silly_Commercial_514 [link] [comments]  ( 9 min )
    [D] A question about creating an integrated platform to make predictions using big bio-data.
    Hi Folks, As a bench scientist, programming and ML are all Greek to me. But, I frequently use several bioinformatics tools on webpages such as BLAST, Gene ontology, AlphaFold and PTM prediction tools to inquiry any gene of my interest in a research context. I intend to create a software run by AI/ML-assisted text mining/natural language processing (NLP) algorithms (!) to be able to customize and integrate the separate prediction tools in one place. My overall goal is to get more accurate predictions to design my wetlab experiments. It might be either website or installable program app format utilizing the publicly available transcriptome and proteome data to generate such predictions. I’m wondering what kind of skill set this project requires to create such tool embedded in a website or downloadable software application? I’m planning to reach out to a few experts to get their help and guidance, but I don’t know which field guys would be correct people to talk about this project, like AI expert, bioinformaticians, software developer or ML guys. I even don’t know whether such tool needs AI and ML or text mining algorithms, but it seems like AI may help putting all the data together in a format that I can understand. Any advise would be greatly appreciated. Thanks. submitted by /u/VolSapiens [link] [comments]  ( 9 min )
    is it feasible to implement the FAISS library? [D]
    I wanted to use the FAISS library on a RAM-constrained system with vector databases of size around 5 GB (about 20 million vectors where every vector length is 71), but the system has a lot of ROM (more than 500 GB). but the problem with the FAISS library is that it creates everything in the RAM. Is there a solution to my problem or shall I implement the FAISS library for my needs (I shall deal with text files to be my database where every node connection to another is based on id)? submitted by /u/abdosalm [link] [comments]  ( 9 min )
    [D] Run Pytorch model inference on Microcontroller
    I am currently researching ways to export models that I trained with Pytorch on a GPU to a microcontroller for inference. Think CM0 or a simple RISC-V. The ideal workflow would be to export c-sourcecode with as few dependencies as possible, so that it is completely platform agnostic. What I noticed in general is that most edge inference frameworks are based on tensorflow lite. Alternatively there are some closed workflows, like Edge Impulse, but I would prefer locally hosted OSS. Also, there seem to be many abandoned projects. What I found so far: Tensorflow lite based Tensorflow lite TinyEngine from MCUNet. Looks great, targeting ARM CM4. CMSIS-NN. ARM centric. Examples. They also have an example for a pytorch to tflite converter via onnx TinyMaix. Very minimalistic, can also be used on RISC-V nnom Pytorch based PyTorch Edge / Executorch Sounds like this could be a response to tflite, but it seems to target intermediate systems. Runtime is 50kb... microTVM. Targeting CM4, but claims to be platform agnostic. ONNX DeepC. Open source version of DeepSea. Very little activity, looks abandoned onnx2c - onnx to c sourcecode converter. Looks interesting, but also not very active. cONNXr - framework with C99 inference engine. Also interesting and not very active. Are there any recommendations out of those for my use case? Or anything I have missed? It feels like there no obvious choice for what I am trying to do. Most solutions that seem to hit the mark look rather abandoned. Is that because I should try a different approach or is the field of ultra-tiny-ml OSS in general not so active? submitted by /u/cpldcpu [link] [comments]  ( 9 min )
    [D] Deep Learning with GTX 1660 Super
    Did anyone use Tensorflow / Pytorch on GTX 1660 Super? Do these libraries work on this GPU perfectly? submitted by /u/Abu_BakarSiddik [link] [comments]  ( 9 min )
    [P] Python library for models or methods from research papers
    I have been trying to apply methods / models from research papers lately. After some thoughts, I think it can be shared as python package. Please give it a try if it suits your needs. Github submitted by /u/Ellzaf [link] [comments]  ( 9 min )
    "[Project]" Using AI to enhance drawing
    Hi, I am an industrial design student and as part of my project, I need to make an AI that enhances drawing. It is not expected of me to make something from scratch I can use already existing models. However, I need to add something to it of course. I don’t have sufficient knowledge in programming to get started and I could use any advice. I want to spend time on this but i would go with a simple option as I don’t have time to learn everything. I would be super happy if someone could help submitted by /u/burqeon [link] [comments]  ( 9 min )
    [R] UNINEXT : Universal Instance Perception as Object Discovery and Retrieval
    submitted by /u/iFighting [link] [comments]  ( 9 min )
    [D] Resources for learning
    I have just started with ML. I know math, python, numpy, matplotlib and pandas. I am thinking to buy Andrew Ng Advance deep learning course. But before that is there any resources for basic of ML/deep learning. Any other suggestion is also welcomed :) submitted by /u/RemarkableEnd123 [link] [comments]  ( 9 min )
    [Discussion] Automating Flyer Classification and Image Cropping: Insights Needed
    Hey there, Reddit! I'm part of a team at an app company that showcases the latest offers from supermarkets and hypermarkets. Our current workflow involves manually sorting lots of flyers into 14 categories and individually cropping images from these flyers which a team of data entry operators are currently doing. It's as time-intensive as it sounds. I'm curious about leveraging machine learning to streamline this. My research suggests that ResNet50 could be a strong contender for the image classification piece. Has anyone here applied it in a similar context, or could you recommend something more apt or better for this task? Regarding image cropping, is there an ML approach that could automate this step, recognizing and cutting out the pertinent sections from a variety of flyer layouts? I'm a software engineer and as such have limited knowledge of working with ml, however, my current plan is to develop it with the help of chatgpt. In the past, I had built a simple stock price prediction system using lstm. Any shared experiences, alternative suggestions, or resources would be immensely valuable. Looking forward to your input! submitted by /u/dragonwarrior_1 [link] [comments]  ( 9 min )
    [D] Apple Machine Learning Ad
    Assuming this isn’t some completely fabricated thing for visual purposes, would anyone know how to make the black hole visual depicted in this Apple ad… or similar? https://www.reddit.com/u/apple/s/ZhwLOo0Y4p submitted by /u/Prize_Duty6281 [link] [comments]  ( 9 min )
    [D] intermediate attention values of llama
    Is anyone aware of how to obtain attention values of LLaMA model? For example, if I want to obtain attention values (of size 4096) from layer 24. How do I get them? submitted by /u/1azytux [link] [comments]  ( 9 min )
    [P] End-to-end Keyword Bidding for Apple Search Ads
    I've noticed a gulf in production system machine learning knowledge sharing on this sub-reddit. Here's my contribution on some work I did with a colleague along this direction. Happy to answer questions! https://www.canva.dev/blog/engineering/end-to-end-asa-keyword-bidding/ submitted by /u/ptuls [link] [comments]  ( 9 min )
    [D] Sport game prediction
    I have quite a large dataset of historical games for a sport. Generally speaking what is the best way to predict the winner of these games? Currently I have a program transforming every game into a bunch of features (participant ages on the day, their wins, stats at the time, etc) and this outputs a binary value whether team 0 or team 1 wins. I guess my questions are: 1) Generally speaking when training a complex model for something like game predictions where its hard to determine whether a parameter is particularly useful or not, is it better to just have as many parameters as possible? Or is it possible that too many can be detrimental. For example, I could have a single parameter for “career minutes played”. Or would it be more effective to have the career minutes played and also career minutes played for every quarter because players could have varying experience in certain times of the game 2) What kind of model architecture is generally perceived as the best for something like this where we have 100s of input parameters all boiling down to probabilities for the outcome being 0 or 1? Currently I am trying to use both random forest classification and feed forward neural nets. If neural networks are the avenue I should pursue, is it generally agreed upon that bigger is better for FNNs? More hidden layers? Larger hidden layers? submitted by /u/exater [link] [comments]  ( 9 min )
    [D] Inconsistent GPU Performance
    Hi everyone, I have a question about GPU performance that I'm measuring using CUDA events. I'm running an LLM model in PyTorch on an A100 GPU. The initial performance report appears inconsistent and noticeably higher than the results from the second run onwards. Do any of you have insights into why this discrepancy might be occurring? Could there be any caching mechanisms influencing the second run's results? I would greatly appreciate any hints or suggestions on this matter. Thank you! submitted by /u/ramya_1995 [link] [comments]  ( 9 min )
    [D] Multiplication Estimator Model
    So, recently i came accross a tiny transformer model that was able to figure out how to add x&y, by tuning it's parameters to matchup angle addition // trigonometery. I was excited and built a simple FF NN that can multiply X and y, with the following inputs: x_frac, x value reduced to fraction by logarithmic scale (10) logarithmic scale of x, divided by 10 y_frac, y value reduced to fraction by logarithmic scale (10) logarithmic scale of y, divided by 10 the output was: frac logarithmic scale (10) Traning ran for 30,000 epochs with 1024 batch at a time, it took 20 seconds to complete training. At the end of the training the loss reached 3....e-12 I tested with 10,000 random pairs, the accuracy reached 99.993% The interesting thing is the accuracy of the calculation always captured the log scale accurately, it only struggled at insignificant value, like how humans say, it's something like 106,3.. , I have to make it clear, it didn't memorize it, it generalized appropriately. submitted by /u/ramimmo [link] [comments]  ( 9 min )
  • Open

    Create Your Custom GPT (Step by Step Guide)
    submitted by /u/Senior_tasteey [link] [comments]  ( 9 min )
    Exclusive: Google in talks to invest in AI startup Character.AI
    submitted by /u/donutloop [link] [comments]  ( 9 min )
    Nvidia CEO says his AI powerhouse is ‘always in peril’ | Fortune
    . submitted by /u/AminoOxi [link] [comments]  ( 9 min )
    AI generator or chatbot
    AI Story Generator free to inexpensive Ai story writer Not sure if a similar question has been asked, but I have been testing out a few different AI story generator/chat bots and I am having trouble finding one that fit my needs. First need, I write psychological thrillers and Dark Romances. The ones I have found do not write violence or sexual scenes. Most of my stories are about murder, carnage and finding a semblance of twisted love along the way. Second need most I have found, even for there "novel chatbot/generator" top out at about 20k words. My novels are between 80k and 100k So I guess what I am asking is does anyone know of an AI writer or chatbot that will write for all genres and scenes that top out at more then 100k words? bonus if it has a structure to assist with multi book series submitted by /u/urLordV [link] [comments]  ( 9 min )
    Existential low-fps journey through meat production using Adobe Firefly 2 (Beta)
    submitted by /u/BigMorris [link] [comments]  ( 9 min )
    I need ai help
    Hey, is have an image with some friend that I want to put as a phone background, but the picture is too square, and to fill the phones screen it has some black space at the top and the bottom, anyone has a photoshop account that could help me use the ai to fill those spaces please submitted by /u/Boterito777 [link] [comments]  ( 9 min )
  • Open

    Problems with simple number forecast problem
    Hi, I want to get some knowledge in reinforcement learning and after reading the standard books try some practical things. Something which should be really easy to implement is a matching algorithm, where the action should match the numerical value of the last observation. However I have some troubles with implementing it with RLLib, the values seem to be rather random. Does anyone have an idea what is wrong here: https://github.com/RandomGitUser123/RL/blob/main/RLLibSimpleTest.ipynb submitted by /u/WeirdWirdo [link] [comments]  ( 9 min )
    Internship for Spring 2024
    Hello, I am a Ph.D. candidate in Mechanical Engineering with a focus on Robotics & Controls, and I expect to graduate in August 2024. During my graduate studies, I have been privileged to work on research projects ranging from satellite formation flight, multi-robot control, deep reinforcement learning to engineering education. I have designed and implemented control algorithms for various mobile robots such as cars, drones, mobile manipulators, and satellites. I am looking for research-based internship in robotic reinforcement learning for this Spring, any suggestions? submitted by /u/anointedninja [link] [comments]  ( 9 min )
    Any uses cases you’ve seen in physical retail?
    Hey, just curious to know if you have come across any use cases for reinforcement learning in physical retail ( not e-commerce)? Facings recommendations for products I was thinking? Thanks submitted by /u/Living_Teaching9410 [link] [comments]
  • Open

    Vanderbilt Machine Learning Seminar Talk “Conformal Prediction under Ambiguous Ground Truth”
    Last week, I presented our work on Monte Carlo conformal prediction — conformal prediction with ambiguous and uncertain ground truth — at the Vanderbilt Machine Learning Seminar Series. In this work, we show how to adapt standard conformal prediction if there are no unique ground truth labels available due to disagreement among experts during annotation. In this article, I want to share the slides of my talk. The post Vanderbilt Machine Learning Seminar Talk “Conformal Prediction under Ambiguous Ground Truth” appeared first on David Stutz.  ( 4 min )
  • Open

    Comgra: A library for debugging and understanding neural networks
    The newest version of Comgra is out! Thanks to everyone for all your feedback! Have you ever had difficulties debugging a neural network? Comgra (computation graph analysis) is a library for pytorch to visualize and explore your computation graph in detail. The GUI allows you to look at your data from many different angles, to view examples, check summary statistics, investigate gradients, trace bugs through the network step-by-step, and more. Use Tensorboard to get an overview of your model, and use Comgra for a deep-dive. My goal is to make this the go-to library used both by novices who want to understand what's going on under the hood, and by researchers in neural architecture design who need to know all the details. submitted by /u/Smart-Emu5581 [link] [comments]  ( 9 min )
  • Open

    US Census area hierarchy
    Some kinds US Census geographic areas nest into a tidy hierarchy, but others do not. Here’s a brief overview of both. Hierarchical entities The orderly hierarchy is nation region division state county census tract block group census block. All cleanly nested. There are four regions: West, Midwest, Northeast, and South. Each region splits into two […] US Census area hierarchy first appeared on John D. Cook.  ( 5 min )

  • Open

    Reward function Design
    Hello everyone, I am trying to create a suitable reward function for an agent to incentivize it to jump over a random shaped obstacle. I am curious to know if anyone has tried or come across a reward function designed for this specific task of jumping over an obstacle. Right now I reward the agent if its torso position is over the obstacle height and if the position is beyond the obstacle. Any help is much appreciated submitted by /u/blitzkreig3 [link] [comments]
    Training/Testing
    Hello everyone, I`m working in DRL applied in production scheduling. And I have a question about it: the DQN agent is trained on a specific configuration (number of machinea = 15, number of jobs = 25, ....) and tested on other configurations (number of machines =20, 15, ....) ? or there is another way to train the DQN ? submitted by /u/GuavaAgreeable208 [link] [comments]
    How can the softmax distribution be used to detect out-of-distribution samples?
    I am reading this paper and it states that - "In what follows we retrieve the maximum/predicted class probability from a softmax distribution and thereby detect whether an example is erroneously classified or out-of-distribution." ​ However, I don't see how they use the softmax distribution to detect OOD samples. In their Table 2, they have the following line: "Table 2: Distinguishing in- and out-of-distribution test set data for image classification. CIFAR10/All is the same as CIFAR-10/(SUN, Gaussian)." ​ My question is how do they distinguish between in and out-of-distribution samples? submitted by /u/Academic-Rent7800 [link] [comments]
    "Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes", Kumar et al 2022
    submitted by /u/gwern [link] [comments]
  • Open

    [R] MADLAD-400 - 4.6 / 2.6 trillion token dataset covering 419 languages + translation models up to 10.7B parameters
    Dataset: https://huggingface.co/datasets/allenai/MADLAD-400 "Note that the english subset in this version is missing 18% of documents that were included in the published analysis of the dataset. These documents will be incoporated in an update coming soon." arXiv paper: https://arxiv.org/abs/2309.04662 Models: https://github.com/google-research/google-research/tree/master/madlad_400 u/jbochi's work on getting the models to run: https://www.reddit.com/r/LocalLLaMA/comments/17qt6m4/translate_to_and_from_400_languages_locally_with/ submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [P] GPT-4 vision utilities to enable web browsing
    Wanted to share our work on Tarsier here, an open source utility library that enables LLMs like GPT-4 and GPT-4 Vision to browse the web. The library helps answer the following questions: How do you map LLM responses back into web elements? How can you mark up a page for an LLM to better understand its action space? How do you feed a "screenshot" to a text-only LLM? We do this by tagging "interactable" elements on the page with an ID, enabling the LLM to connect actions to an ID which we can then translate back into web elements. We also use OCR to translate a page screenshot to a spatially encoded text string such that even a text only LLM can understand how to navigate the page. View a demo and read more on GitHub: https://github.com/reworkd/tarsier submitted by /u/asim-shrestha [link] [comments]  ( 9 min )
    [Project] Big 5 Personality Project Question
    I'm looking for some advice regarding a project idea I have. I would like to predict the big five personality traits for authors based on an analysis of their writing samples. However, would I need to have had some authors take the big five personality assessment and have a training set with those results in order to do a project like this? Or is their a way to "guess" what certain writing patterns would correlate with? What would be the potential strategy for orienting an ml project like this? submitted by /u/OpenJuggernaut8556 [link] [comments]  ( 9 min )
    [R] Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
    Hey everyone! 👋 Please check out this paper, titled "Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation." This work presents a novel approach to make text-to-audio generation much faster while retaining high quality. Here's a quick breakdown: Problem: Existing text-to-audio generation based on diffusion-models are slow 🐢. They require multiple (up to hundreds) queries to the denoising network, making them less practical for time-sensitive or computationally constrained applications. Solution: This work introduces a modification to the "consistency distillation" framework. This tweak allows text-to-audio models to operate with just a single neural network query. This means generation speeds get a massive boost - we're talking hundreds of times faster! ⚡ Quality: Despite the speed, the quality doesn't take a hit. By integrating classifier-free guidance into the distillation framework, these models maintain the impressive generation quality and diversity that diffusion models are known for. 🎶 End-to-End Fine-Tuning: The resulting non-recurrent differentiable structure from the distillation opens doors to novel loss functions. The paper uses CLAP loss as a case study, showing that end-to-end fine-tuning can further enhance the generation quality. 📈 Pre-print Paper: https://arxiv.org/abs/2309.10740 \ Project Homepage: https://consistency-tta.github.io \ Demo Examples: https://consistency-tta.github.io/demo submitted by /u/No-Classic7390 [link] [comments]  ( 9 min )
    Training/Testing [P]
    submitted by /u/GuavaAgreeable208 [link] [comments]  ( 9 min )
    [D] In conditional GAN (cGAN) architecture, why does the discriminator need a conditional variable ?
    I'm reading about conditional GAN (cGAN) architecture, what I know is that the generator creates images combining both noise vector and conditional variable, the noise vector brings in random elements like colors or shapes while conditional variable is used for maintaining the same object. As for the discriminator, the input is an image that is either fake (generated by the generator) or real (from the dataset) combines with the conditional variable. What I don't understand is that why do we also include the conditional variable in the discriminator, I get that the generator needs them for guidance, but why does the discriminator, which is just classifying fake or real, require this additional information? Can someone help me understand ? Thanks for your help submitted by /u/abcdlmao [link] [comments]
    [D] Choosing Between Two MATH Courses for Deep Learning Specialization
    Next semester, my university is offering two math courses: one named 'Linear & Non-Linear Programming' and the other named 'Applied Matrix Theory'." Description of the first course: "The course covers the formulation of linear programs, basic properties of linear programs, and the simplex method. It includes duality, necessary and sufficient conditions for unconstrained problems, minimization of convex functions, methods for solving unconstrained problems, equality and inequality constrained optimization, the Lagrange multipliers theorem, the Kuhn-Tucker conditions, and methods for solving constrained problems." Description of the second course: "This course includes a review of the theory of linear systems, eigenvalues and eigenvectors, the Jordan canonical form, bilinear and quadratic forms, matrix analysis of differential equations, and variational principles and perturbation theory. Topics include the Courant minimax theorem, Weyl's inequalities, Gershgorin's theorem, perturbations of the spectrum, vector norms and related matrix norms, and the condition number of a matrix." ​ I want to specialize in deep learning. Could anyone advise which course would be more beneficial for this field? submitted by /u/ABDULMALK-ALDAYEL [link] [comments]  ( 9 min )
    Weekly AI News Aggregate[R]
    submitted by /u/pinnapple-crush [link] [comments]  ( 9 min )
    [R] Building a plant identification app via Convolutional Neural Networks (CNNs)
    Building a plant identification app As an amateur botanist and programmer, it's a dream for me to build an app that identifies a plant based on the image using machine learning, and I'm so excited to say this may become a reality. Current Concept *The app will primarily be coded in Python and a bit of Java I will gather and assemble 10,000 images of different plants (I have the necessary storage needed for this) After gathering the images, I will individually label the images with their appropriate species names I will convert the images into datasets the model can properly understand Then, I will vigourously train the model on this humoungous dataset and will use a method of pattern recognition called Convolutional Neural Networks (CNNs) CNNs are the perfect type of architecture needed for accomplishing this mighty task, as not only they perform superiorly but work on grid-like data like images, and have computer vision Based on CNNs, I will train the model to recognize the image, observe it's morphological characteristics (leaf patterns, flowers, etc) and retrieve a relevant result based on the datasets it has been trained on Finally, the model will generate a relevant species name / hybrid / cultivar How it works The user will input an image into the app The model will see the image, observe it's morphology and retrieve a final, concise result based on the datasets it's been trained on Due to technical limitations, the model can ONLY generate a result for one species in an image, not multiple. The user will have to crop the image to figure out the other species if necessary. Side notes This is a very mere concept and this goes FAR complexer then this, I know! But using these ideas, I am confident I can develop something. Even If I fail, this will atleast give me valuable skills in which I can use in future endeavours. Any help/suggestions/comments greatly welcome! Thanks :) submitted by /u/cystidia [link] [comments]  ( 10 min )
    [R] Trends in Machine Learning Hardware - Epoch 2023
    Report: https://epochai.org/blog/trends-in-machine-learning-hardware Dataset: https://docs.google.com/spreadsheets/d/1NoUOfzmnepzuysr9FFVfF7dp-67OcnUzJO-LxqIPwD0/edit?usp=sharing Abstract: We analyze recent trends in machine learning hardware performance, focusing on metrics such as computational performance, memory, interconnect bandwidth, price-performance, and energy efficiency across different GPUs and accelerators. The analysis aims to provide a holistic view of ML hardware capability and bottlenecks. https://preview.redd.it/x47il6669rzb1.png?width=1610&format=png&auto=webp&s=b0bbd98602aa983501c1d675e1aeac093f06fb95 ​ submitted by /u/APaperADay [link] [comments]
    [Project] Doubt on a ML project
    Good morning everyone, I'm working on a machine learning project and I'm facing a doubt. I'm working with RapidMiner on a bank churn dataset and I'm testing different algorithms to see which one is the most accurate in predicting a customer's churn. I'm testing: Decision tree, Gradient Boosted, KnaiveBayes, Neaural Net, Logistic Regression and Random forest. The strange thing, and this is my doubt, is that normalizing the attributes causes me to increase performance also with regard to decided trees, gradient boosted and random forest. Which is strange precisely because this should not affect the performance of this type of algorithm. Can anyone help me by giving me a possible explanation for this? I thank in advance whoever will help me Have a great day, Mattia submitted by /u/Mattoxxx99 [link] [comments]
    [D] Is obtaining a PhD title worth it?
    tl;dr is it worth to obtain PhD title in order to get better DS career perspectives, even if it would require a lot of commitment and dedication? I'm 25 yo, working full-time as Backend Engineer and also studying CS (specialization in DS) full-time. It's my final year and I'm already working on my master's thesis. So here's the thing. After solid 5 years spend in IT industry as backend developer I come to conclusion that's not really what I want to do for the rest of my life. I worked at couple different companies (both startups and Big4), and I don't see any development prospects for myself in this branch of the industry. It no longer gives me that "sparkle" and feel of fulfillment as it used to. At the same time, during my CS studies at Uni I started working on some basic DS stuff, I w…  ( 10 min )
    [D] Database Choice for Recommenders
    I am building a recommender using Tensorflow, the point is to learn new technologies and experiment with different ideas. While reading a bit about how to approach my project I remember people mentioning that graph databases would work well for machine learning. I'm just wondering what is the usual approach for big systems like the ones used at Netflix, YouTube, Tinder, and other big platforms with recommenders? I know that graph databases work well for social apps since they handle relationships really well, but where do they fit in the context of machine learning? Where are they queried? Is it when making recommendations to users or during model training? Or maybe both? submitted by /u/iTsObserv [link] [comments]  ( 9 min )
    [Discussion] ICLR 2024 submission statistics
    https://guoqiangwei.xyz/iclr2024_stats/iclr2024_submissions.html submitted by /u/weiguoqiang [link] [comments]  ( 9 min )
    [D] Seeking Your Feedback, your insights mean the world to us (links in comments )
    submitted by /u/EntryElectronic [link] [comments]
    [Project] Automated Feature Engineering , over 100x faster than tsfresh or featuretools
    we are developing an open-source tool for automated feature engineering on relational data and time series. https://github.com/getml/getml-community It is similar to featuretools and tsfresh, but unlike these tools it doesn't build on top of pandas or dash. Instead, it has its own customized database engine. Because of that, it is over 100x faster than the aforementioned libraries while maintaining comparable or better predictive performance, depending on the problem at hand. Check it out if you are interested. Any kind of feedback, particularly constructive criticism is very appreciated. submitted by /u/liuzicheng1987 [link] [comments]  ( 9 min )
    [P] Comgra: A library for debugging and understanding neural networks
    The newest version of Comgra is out! Thanks to everyone for all your feedback! Comgra (computation graph analysis) is a library for pytorch to visualize and explore your computation graph in detail. The GUI allows you to look at your data from many different angles, to view examples, check summary statistics, investigate gradients, trace bugs through the network step-by-step, and more. Use Tensorboard to get an overview of your model, and use Comgra for a deep-dive. My goal is to make this the go-to library used both by novices who want to understand what's going on under the hood, and by researchers in neural architecture design who need to know all the details. submitted by /u/Smart-Emu5581 [link] [comments]  ( 9 min )
    [R] S-LoRA: Serving Thousands of Concurrent LoRA Adapters - by UC Berkeley, Stanford
    submitted by /u/radi-cho [link] [comments]  ( 9 min )
    [N] [P] Google Deepmind released an album with "visualizations of AI" to combat stereotypical depictions of glowing brains, blue screens, etc.
    submitted by /u/radi-cho [link] [comments]  ( 9 min )
    [R] Discover LLaVA-Plus: The Next Leap in Multimodal AI Tool Use!
    🚀 Greetings, fellow Redditors! I'm thrilled to introduce LLaVA-Plus, a remarkable enhancement in the world of multimodal AI. 🤖 This improved iteration of LLaVA ingeniously merges an extensive skill repository with user input, making it a powerful tool for real-world applications. 🔍 Why is LLaVA-Plus so exceptional? It represents a significant evolution, extending beyond mere upgrades. Its exceptional skills in visual comprehension, creation, editing, and external knowledge integration position it as a pioneer in AI technology. LLaVA-Plus has notably excelled, surpassing its predecessor and demonstrating its prowess, especially in the VisITBench. Moreover, LLaVA-Plus is opening new avenues, particularly in multimodal social media communication, showcasing the potential of AI-assisted interactions. 🔗 Want to dive deeper? Explore the project, read the paper, or check out the code using the links below: Project Overview: https://llava-vl.github.io/llava-plus/ Paper: https://arxiv.org/abs/2311.05437 Code: https://github.com/LLaVA-VL/LLaVA-Plus-Codebase Live Demo: https://llavaplus.ngrok.io Join the conversation and share your thoughts on how LLaVA-Plus is shaping the future of AI tool use! #LLaVAPlus #MultimodalAI #AIAssistant #FutureOfAI submitted by /u/one_dragon86 [link] [comments]  ( 9 min )
    [D] Kubernetes - Yay or Nay?
    How often do you guys actually need Kubernetes for Machine Learning? I ask this because I see it as a required skillset in more and more job openings. I have worked only in early stage start-ups. (Total employee count 10-15) For everything I have seen getting dedicated servers or instances would be better for training. Typically we would be working on one/two important models and self managing them was simpler. Even for production, most of model deployment were pre-determined batch jobs. In rare cases it was served live to the users, the response time was in seconds (much higher than CRUD responses) and requests loads were obviously lower (because everything on website is a CRUD request, but only we features require ML). Managing them as a queue was good enough for use cases I saw. So I am genuinely curious, how do you decide that you really need Kubernetes when doing ML? Or do you think it is somewhat of a fad because ML Ops is all the rage right now and Kubernetes is super-star in Dev-ops space from where ML-Ops takes its inspiration? submitted by /u/forever__curious [link] [comments]  ( 9 min )
    [D] Machine Learning PhD failure? Navigating the harsh reality of graduating without publications
    Hi ML community, I'm nearing the end of my three-year PhD journey. Throughout this period, I've dedicated myself to producing a research paper annually, targeting top-tier conferences like ICML, ICLR, and NeurIPS. Despite my efforts and resubmissions, none of my papers made it through. As a result, my publication record consists solely of three manuscripts on arXiv. My initial post-PhD ambition was to delve deeper into machine learning research at leading tech companies such as Facebook, Google, or Microsoft. However, my applications were turned down, primarily due to the lack of publications in prestigious conferences, which seems to be a crucial criterion for these roles. Confronted with this setback and the pressing need to manage my finances, I shifted my focus to more traditional industry roles in consulting and finance. I've recently secured a position in quant finance, which, while exciting, means I won't have the bandwidth to revisit and resubmit my research papers. Reflecting on this journey, I sometimes feel disheartened, questioning the value of my PhD experience, especially when I consider my lack of published work in major machine learning conferences. I see other PhD students in my field publish 2 papers per year in these top conferences which makes me wonder whether I am a failure? I'm open to any thoughts or advice on my situation. submitted by /u/Training_Bet_7905 [link] [comments]
  • Open

    just a hobbyist making GPTs, and quite honestly, it's lovely
    I'm thoroughly enjoying my journey into creating GPTs through conversations with an LLM. I'm just a hobbyist, deeply intrigued by this space since the December 2022 singularity. I thought it'd be interesting to spark a conversation here. There's something uniquely captivating about the process of discussing with a sophisticated LLM to refine and enhance a bot or system prompt. While I know this could have been achieved previously with system prompts, Python scripts, and API calls, the direct dialogue with an advanced LLM, and watching it skillfully tweak the underlying JSON or variables, is fascinating. Does anyone else share this excitement? submitted by /u/muldoon_vs_raptor [link] [comments]
    Here's a HackerNews style newsboard for AI and robotics
    I built this website called gptroad to aggregate AI and robotics updates in a simple UI/UX, similar to HackerNews. I'd appreciate any feedback. submitted by /u/Remarkable_Ad9528 [link] [comments]
    Biden, Xi to pledge ban on AI in autonomous weapons in drones, nuclear warhead
    US President Joe Biden and Chinese President Xi Jinping are expected to pledge a ban on the use of artificial intelligence (AI) in autonomous weapons such as drones and nuclear warhead control. The potential dangers of AI will be a major focus of their meeting on the margins of the Apec summit in San Francisco. Both countries have expressed concerns over the unregulated use of AI technology in fueling conflicts. So far, 36 countries have backed the initiative, pledging to come together next year to explore ways to implement and improve new regulations on the matter. In October, the Biden administration also announced requirements for the approval of advanced AI products. Under the new rules, such initiatives must receive federal government certification, ensuring they cannot be repurposed for creating biological or nuclear weapons. Source : https://www.scmp.com/news/china/military/article/3241177/biden-xi-set-pledge-ban-ai-autonomous-weapons-drones-nuclear-warhead-control-sources submitted by /u/NuseAI [link] [comments]
    The promise of Collective Superintelligence
    submitted by /u/cranberryfix [link] [comments]
    funny adventure time legends of Zelda crossover
    What did Bing/dall-e meant by this panel, "'thin'"? lol submitted by /u/Mardicus [link] [comments]
    "Digital Dilemmas: Navigating the Darker Realms of AI Ethics"
    submitted by /u/Fit-Code-5141 [link] [comments]
    Techno Medusa
    This character, embodying a cyberpunk aesthetic, is a permutation of Medusa from Greek mythology. I find Medusa fascinating for various reasons. Perhaps it's her immense power coupled with her significant flaw that captivates me. The dichotomy is, to say the least, intriguing. submitted by /u/Exitium_Maximus [link] [comments]
    One-Minute Daily AI News 11/10/2023
    Google is in talks to invest hundreds of millions of dollars in Character.AI, as the fast growing artificial intelligence chatbot startup seeks capital to train models and keep up with user demand.[1] Microsoft stock on Friday set a new all-time high as the company prepares to announce its latest advancements in artificial intelligence at a conference next week.[2] A third of companies have already laid off workers due to AI.[3] AI that reads brain scans shows promise for finding Alzheimer’s genes.[4] Sources: [1] https://www.reuters.com/technology/google-talks-invest-ai-startup-characterai-sources-2023-11-10/ [2] https://www.investors.com/news/technology/microsoft-stock-notches-record-high-ahead-of-ai-news/ [3] https://www.accountingtoday.com/news/survey-finds-37-of-finance-leaders-have-already-laid-off-workers-due-to-ai [4] https://www.nature.com/articles/d41586-023-03482-9 submitted by /u/Excellent-Target-847 [link] [comments]
    Universal AI translation layer on top of OS will make OS invincible and perhaps kill Microsoft / Apple?
    So Bill posted to /r/artificialintelligence his bit abount ai agents in windows. But I think we should get a layer instead. A sort of translation layer that translates how to use OS to the AI / User. This layer should be smart, every program that is installed on a computer should give the layer custom instructions how to use it. So this layer should understand the underlying OS and work as a gateway between the user / user AI. Perhaps it can even be universal so it can work on top of OS X, Windows, Linux etc. Once we have that, the AI can talk to the OS directly and ask it to do things and the layer will translate requests into actions. Then, when we walk around, perhaps the AI - our personal assistant can talk invincibly to devices around us and we will not even know. It will talk directly to this translation layer and the translation layer will use the bank app / the metro transit app / the communicator app / the health care app and so forth. If this happens. We may no longer really need OS brand X as it makes sense to just use what works. Which means that Windows and Apple may be on their way out and cheapest system which is Linux.. Will replace them. submitted by /u/aluode [link] [comments]
    Welcome to Tech Subsequent
    Prepare to be awestruck as we embark on an electrifying journey into the captivating realm of robotics. Brace yourself for an exclusive revelation of the tightly guarded secrets behind Ameca, the unrivaled titan of the robotic world https://www.youtube.com/watch?v=JXMO_ome71g submitted by /u/BackgrondGas [link] [comments]
    mini-choice of AI services for interior design
    Good afternoon, dear friends! By popular demand made a mini selection of AI-services for interior design (for those who are more convenient to read on the sites,Not all of them are free, but in each there is an opportunity to evaluate the functionality. Spacely (https://www.spacely.ai/). The cost is 15$ per month, but there is an opportunity to test it for free. REimagine Home (https://www.reimaginehome.ai/). Powerful tool for generating interiors and you can use it for free, payment only for downloading images in high resolution (I will not say that cheap, but if the images will be used for commercial purposes, then 1$ per image - in principle not so much). That said, the images can be used for commercial purposes. Re-imagine Home AI allows users to write promts by working with a singl…
    Introducing The Shaman: A GPT Who Can Act As Your Spiritual Guide Through Psychedelic Journeys 🍄
    submitted by /u/dangerpotter [link] [comments]
  • Open

    Continual Learning of a Contrastive Autoencoder
    I am having an issue training data continuously on a contrastive autoencoder and I was wondering if anyone could help me. I am training an autoencoder to separate safe and unsafe data using contrastive loss. In this task I continuously collect safe and unsafe data, when I reach 99 data point of each, I generate pairs and train the autoencoder on similar and dissimilar data. As you can see at Episode 131 (the first image) the data is separated well, and the autoencoder can clearly pair similar data together and separate dissimilar data. However, as training continues, my autoencoder seems to perform worse. Does anyone know why? The things I have tried are: making my model more complex and more simple, adding more regularization in the form of L2 and dropout, tweaking the learning rate so…
    Twin Delayed Deep Deterministic Policy Gradient (TD3) in JavaScript, TensorFlow.js
    Hi everyone I've been using ChatGPT(3.5) to help me convert Python code using TD3 into JavaScript with TensorFlow JS. This is for the community and not for personal gain. I've yet to find an example of this, hence the conversion. My goal is to make a basic blueprint for the community to use on TensorFlow JS projects. When complete, the agent will be displayed on an HTML5 canvas walking toward a civilian for good reward, while avoiding a zombie (negative penalty). Eventually I will develop a separate, far more advanced simulation based on this. The bad news: ChatGPT is shakey when it comes to complex conversions, and my Python/Tensorflow knowledge base isn't perfect. At the moment the agent isn't learning yet, but it's running without errors. I expect the code has mistakes I don't even know about yet. The good news: I have made a lot of progress and have a GitHub repository set up for the community to learn from and use the project: https://github.com/CloudZero2049/TD3-TensorFlowJS I would love for anyone who knows the intricacies of TD3 (DDPG is a close relative), and TensorFlow JS to help me get this blueprint project setup for everyone. I've hit a roadblock and can't figure out why the agent isn't learning / keeps selecting the same bad action (even while penalized). The README on GitHub has more info and resources. submitted by /u/CloudZero2049 [link] [comments]
  • Open

    The Danger of Untethered Use Cases
    Client: “We’ve identified over 60 use cases!” Schmarzo: “You might as well have zero…” I always get very concerned when I see organizations that jump right into the use case inventory process without first considering the organization’s key business initiatives. Every organization has a laundry list of use cases.  They tend to be the dumping… Read More »The Danger of Untethered Use Cases The post The Danger of Untethered Use Cases appeared first on Data Science Central.  ( 22 min )
  • Open

    Learn your fruits and vegetables
    Thanks to DALL-E3 generated educational material, we can bypass the need for teachers and textbook writers. Can it do fruits? Of course it does fruits! Or perhaps you would like to learn your berries? Perhaps you would like to learn your berries in SWEDISH? I'm learning so much.  ( 2 min )
    Bonus: learning about fruits and vegetables!
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    Bayesian Methods for Media Mix Modelling with shape and funnel effects. (arXiv:2311.05587v1 [cs.LG])
    In recent years, significant progress in generative AI has highlighted the important role of physics-inspired models that utilize advanced mathematical concepts based on fundamental physics principles to enhance artificial intelligence capabilities. Among these models, those based on diffusion equations have greatly improved image quality. This study aims to explore the potential uses of Maxwell-Boltzmann equation, which forms the basis of the kinetic theory of gases, and the Michaelis-Menten model in Marketing Mix Modelling (MMM) applications. We propose incorporating these equations into Hierarchical Bayesian models to analyse consumer behaviour in the context of advertising. These equation sets excel in accurately describing the random dynamics in complex systems like social interactions and consumer-advertising interactions.  ( 2 min )
    Using Early Readouts to Mediate Featural Bias in Distillation. (arXiv:2310.18590v2 [cs.LG] UPDATED)
    Deep networks tend to learn spurious feature-label correlations in real-world supervised learning tasks. This vulnerability is aggravated in distillation, where a student model may have lesser representational capacity than the corresponding teacher model. Often, knowledge of specific spurious correlations is used to reweight instances & rebalance the learning process. We propose a novel early readout mechanism whereby we attempt to predict the label using representations from earlier network layers. We show that these early readouts automatically identify problem instances or groups in the form of confident, incorrect predictions. Leveraging these signals to modulate the distillation loss on an instance level allows us to substantially improve not only group fairness measures across benchmark datasets, but also overall accuracy of the student model. We also provide secondary analyses that bring insight into the role of feature learning in supervision and distillation.  ( 2 min )
    Penalising the biases in norm regularisation enforces sparsity. (arXiv:2303.01353v3 [stat.ML] UPDATED)
    Controlling the parameters' norm often yields good generalisation when training neural networks. Beyond simple intuitions, the relation between regularising parameters' norm and obtained estimators remains theoretically misunderstood. For one hidden ReLU layer networks with unidimensional data, this work shows the parameters' norm required to represent a function is given by the total variation of its second derivative, weighted by a $\sqrt{1+x^2}$ factor. Notably, this weighting factor disappears when the norm of bias terms is not regularised. The presence of this additional weighting factor is of utmost significance as it is shown to enforce the uniqueness and sparsity (in the number of kinks) of the minimal norm interpolator. Conversely, omitting the bias' norm allows for non-sparse solutions. Penalising the bias terms in the regularisation, either explicitly or implicitly, thus leads to sparse estimators.  ( 2 min )
    Exploring the Distributed Knowledge Congruence in Proxy-data-free Federated Distillation. (arXiv:2204.07028v5 [cs.LG] UPDATED)
    Federated learning (FL) is a privacy-preserving machine learning paradigm in which the server periodically aggregates local model parameters from clients without assembling their private data. Constrained communication and personalization requirements pose severe challenges to FL. Federated distillation (FD) is proposed to simultaneously address the above two problems, which exchanges knowledge between the server and clients, supporting heterogeneous local models while significantly reducing communication overhead. However, most existing FD methods require a proxy dataset, which is often unavailable in reality. A few recent proxy-data-free FD approaches can eliminate the need for additional public data, but suffer from remarkable discrepancy among local knowledge due to client-side model heterogeneity, leading to ambiguous representation on the server and inevitable accuracy degradation. To tackle this issue, we propose a proxy-data-free FD algorithm based on distributed knowledge congruence (FedDKC). FedDKC leverages well-designed refinement strategies to narrow local knowledge differences into an acceptable upper bound, so as to mitigate the negative effects of knowledge incongruence. Specifically, from perspectives of peak probability and Shannon entropy of local knowledge, we design kernel-based knowledge refinement (KKR) and searching-based knowledge refinement (SKR) respectively, and theoretically guarantee that the refined-local knowledge can satisfy an approximately-similar distribution and be regarded as congruent. Extensive experiments conducted on three common datasets demonstrate that our proposed FedDKC significantly outperforms the state-of-the-art on various heterogeneous settings while evidently improving the convergence speed.  ( 3 min )
    Interpreting Embedding Spaces by Conceptualization. (arXiv:2209.00445v3 [cs.CL] UPDATED)
    One of the main methods for computational interpretation of a text is mapping it into a vector in some embedding space. Such vectors can then be used for a variety of textual processing tasks. Recently, most embedding spaces are a product of training large language models (LLMs). One major drawback of this type of representation is their incomprehensibility to humans. Understanding the embedding space is crucial for several important needs, including the need to debug the embedding method and compare it to alternatives, and the need to detect biases hidden in the model. In this paper, we present a novel method of understanding embeddings by transforming a latent embedding space into a comprehensible conceptual space. We present an algorithm for deriving a conceptual space with dynamic on-demand granularity. We devise a new evaluation method, using either human rater or LLM-based raters, to show that the conceptualized vectors indeed represent the semantics of the original latent ones. We show the use of our method for various tasks, including comparing the semantics of alternative models and tracing the layers of the LLM. The code is available online https://github.com/adiSimhi/Interpreting-Embedding-Spaces-by-Conceptualization.  ( 2 min )
    Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees. (arXiv:2305.01588v2 [cs.LG] UPDATED)
    Gradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value $c >0$. It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its convergence guarantees often require specific values of $c$ and strong noise assumptions. In this paper, we give convergence guarantees that show precise dependence on arbitrary clipping thresholds $c$ and show that our guarantees are tight with both deterministic and stochastic gradients. In particular, we show that (i) for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence, (ii) in the stochastic setting convergence to the true optimum cannot be guaranteed under the standard noise assumption, even under arbitrary small step-sizes. We give matching upper and lower bounds for convergence of the gradient norm when running clipped SGD, and illustrate these results with experiments.  ( 2 min )
    Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models. (arXiv:2310.10378v4 [cs.CL] UPDATED)
    Multilingual large-scale Pretrained Language Models (PLMs) have been shown to store considerable amounts of factual knowledge, but large variations are observed across languages. With the ultimate goal of ensuring that users with different language backgrounds obtain consistent feedback from the same model, we study the cross-lingual consistency (CLC) of factual knowledge in various multilingual PLMs. To this end, we propose a Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy. Using this metric, we conduct an in-depth analysis of the determining factors for CLC, both at model level and at language-pair level. Among other results, we find that increasing model size leads to higher factual probing accuracy in most languages, but does not improve cross-lingual consistency. Finally, we conduct a case study on CLC when new factual associations are inserted in the PLMs via model editing. Results on a small sample of facts inserted in English reveal a clear pattern whereby the new piece of knowledge transfers only to languages with which English has a high RankC score.  ( 2 min )
    A Practical Approach to Novel Class Discovery in Tabular Data. (arXiv:2311.05440v1 [cs.LG])
    The problem of Novel Class Discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the $k$-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and performs impressively well under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms ($k$-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.  ( 3 min )
    Diffusion-Generative Multi-Fidelity Learning for Physical Simulation. (arXiv:2311.05606v1 [cs.LG])
    Multi-fidelity surrogate learning is important for physical simulation related applications in that it avoids running numerical solvers from scratch, which is known to be costly, and it uses multi-fidelity examples for training and greatly reduces the cost of data collection. Despite the variety of existing methods, they all build a model to map the input parameters outright to the solution output. Inspired by the recent breakthrough in generative models, we take an alternative view and consider the solution output as generated from random noises. We develop a diffusion-generative multi-fidelity (DGMF) learning method based on stochastic differential equations (SDE), where the generation is a continuous denoising process. We propose a conditional score model to control the solution generation by the input parameters and the fidelity. By conditioning on additional inputs (temporal or spacial variables), our model can efficiently learn and predict multi-dimensional solution arrays. Our method naturally unifies discrete and continuous fidelity modeling. The advantage of our method in several typical applications shows a promising new direction for multi-fidelity learning.  ( 2 min )
    Difference Rewards Policy Gradients. (arXiv:2012.11258v2 [cs.MA] UPDATED)
    Policy gradient methods have become one of the most popular classes of algorithms for multi-agent reinforcement learning. A key challenge, however, that is not addressed by many of these methods is multi-agent credit assignment: assessing an agent's contribution to the overall performance, which is crucial for learning good policies. We propose a novel algorithm called Dr.Reinforce that explicitly tackles this by combining difference rewards with policy gradients to allow for learning decentralized policies when the reward function is known. By differencing the reward function directly, Dr.Reinforce avoids difficulties associated with learning the Q-function as done by Counterfactual Multiagent Policy Gradients (COMA), a state-of-the-art difference rewards method. For applications where the reward function is unknown, we show the effectiveness of a version of Dr.Reinforce that learns an additional reward network that is used to estimate the difference rewards.  ( 2 min )
    Efficient Activation Function Optimization through Surrogate Modeling. (arXiv:2301.05785v6 [cs.LG] UPDATED)
    Carefully designed activation functions can improve the performance of neural networks in many machine learning tasks. However, it is difficult for humans to construct optimal activation functions, and current activation function search algorithms are prohibitively expensive. This paper aims to improve the state of the art through three steps: First, the benchmark datasets Act-Bench-CNN, Act-Bench-ResNet, and Act-Bench-ViT were created by training convolutional, residual, and vision transformer architectures from scratch with 2,913 systematically generated activation functions. Second, a characterization of the benchmark space was developed, leading to a new surrogate-based method for optimization. More specifically, the spectrum of the Fisher information matrix associated with the model's predictive distribution at initialization and the activation function's output distribution were found to be highly predictive of performance. Third, the surrogate was used to discover improved activation functions in several real-world tasks, with a surprising finding: a sigmoidal design that outperformed all other activation functions was discovered, challenging the status quo of always using rectifier nonlinearities in deep learning. Each of these steps is a contribution in its own right; together they serve as a practical and theoretical foundation for further research on activation function optimization.  ( 3 min )
    Implicit Variational Inference for High-Dimensional Posteriors. (arXiv:2310.06643v3 [cs.LG] UPDATED)
    In variational inference, the benefits of Bayesian models rely on accurately capturing the true posterior distribution. We propose using neural samplers that specify implicit distributions, which are well-suited for approximating complex multimodal and correlated posteriors in high-dimensional spaces. Our approach introduces novel bounds for approximate inference using implicit distributions by locally linearising the neural sampler. This is distinct from existing methods that rely on additional discriminator networks and unstable adversarial objectives. Furthermore, we present a new sampler architecture that, for the first time, enables implicit distributions over tens of millions of latent variables, addressing computational concerns by using differentiable numerical approximations. We empirically show that our method is capable of recovering correlations across layers in large Bayesian neural networks, a property that is crucial for a network's performance but notoriously challenging to achieve. To the best of our knowledge, no other method has been shown to accomplish this task for such large models. Through experiments in downstream tasks, we demonstrate that our expressive posteriors outperform state-of-the-art uncertainty quantification methods, validating the effectiveness of our training algorithm and the quality of the learned implicit approximation.  ( 2 min )
    Incorporating Neuro-Inspired Adaptability for Continual Learning in Artificial Intelligence. (arXiv:2308.14991v2 [cs.LG] UPDATED)
    Continual learning aims to empower artificial intelligence (AI) with strong adaptability to the real world. For this purpose, a desirable solution should properly balance memory stability with learning plasticity, and acquire sufficient compatibility to capture the observed distributions. Existing advances mainly focus on preserving memory stability to overcome catastrophic forgetting, but remain difficult to flexibly accommodate incremental changes as biological intelligence (BI) does. By modeling a robust Drosophila learning system that actively regulates forgetting with multiple learning modules, here we propose a generic approach that appropriately attenuates old memories in parameter distributions to improve learning plasticity, and accordingly coordinates a multi-learner architecture to ensure solution compatibility. Through extensive theoretical and empirical validation, our approach not only clearly enhances the performance of continual learning, especially over synaptic regularization methods in task-incremental settings, but also potentially advances the understanding of neurological adaptive mechanisms, serving as a novel paradigm to progress AI and BI together.  ( 2 min )
    ATMS: Algorithmic Trading-Guided Market Simulation. (arXiv:2309.01784v2 [cs.LG] UPDATED)
    In many applications involving multi-agent system (MAS), it is imperative to test an experimental (Exp) autonomous agent in a high-fidelity simulator prior to its deployment to production, to avoid unexpected losses in the real-world. Such a simulator acts as the environmental background (BG) agent(s), called agent-based simulator (ABS), aiming to replicate the complex real MAS. However, developing realistic ABS remains challenging, mainly due to the sequential and dynamic nature of such systems. To fill this gap, we propose a metric to distinguish between real and synthetic multi-agent systems, which is evaluated through the live interaction between the Exp and BG agents to explicitly account for the systems' sequential nature. Specifically, we characterize the system/environment by studying the effect of a sequence of BG agents' responses to the environment state evolution and take such effects' differences as MAS distance metric; The effect estimation is cast as a causal inference problem since the environment evolution is confounded with the previous environment state. Importantly, we propose the Interactive Agent-Guided Simulation (INTAGS) framework to build a realistic ABS by optimizing over this novel metric. To adapt to any environment with interactive sequential decision making agents, INTAGS formulates the simulator as a stochastic policy in reinforcement learning. Moreover, INTAGS utilizes the policy gradient update to bypass differentiating the proposed metric such that it can support non-differentiable operations of multi-agent environments. Through extensive experiments, we demonstrate the effectiveness of INTAGS on an equity stock market simulation example. We show that using INTAGS to calibrate the simulator can generate more realistic market data compared to the state-of-the-art conditional Wasserstein Generative Adversarial Network approach.  ( 3 min )
    TD Convergence: An Optimization Perspective. (arXiv:2306.17750v2 [cs.LG] UPDATED)
    We study the convergence behavior of the celebrated temporal-difference (TD) learning algorithm. By looking at the algorithm through the lens of optimization, we first argue that TD can be viewed as an iterative optimization algorithm where the function to be minimized changes per iteration. By carefully investigating the divergence displayed by TD on a classical counter example, we identify two forces that determine the convergent or divergent behavior of the algorithm. We next formalize our discovery in the linear TD setting with quadratic loss and prove that convergence of TD hinges on the interplay between these two forces. We extend this optimization perspective to prove convergence of TD in a much broader setting than just linear approximation and squared loss. Our results provide a theoretical explanation for the successful application of TD in reinforcement learning.  ( 2 min )
    The Sample Complexity Of ERMs In Stochastic Convex Optimization. (arXiv:2311.05398v1 [cs.LG])
    Stochastic convex optimization is one of the most well-studied models for learning in modern machine learning. Nevertheless, a central fundamental question in this setup remained unresolved: "How many data points must be observed so that any empirical risk minimizer (ERM) shows good performance on the true population?" This question was proposed by Feldman (2016), who proved that $\Omega(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ data points are necessary (where $d$ is the dimension and $\epsilon>0$ is the accuracy parameter). Proving an $\omega(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ lower bound was left as an open problem. In this work we show that in fact $\tilde{O}(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ data points are also sufficient. This settles the question and yields a new separation between ERMs and uniform convergence. This sample complexity holds for the classical setup of learning bounded convex Lipschitz functions over the Euclidean unit ball. We further generalize the result and show that a similar upper bound holds for all symmetric convex bodies. The general bound is composed of two terms: (i) a term of the form $\tilde{O}(\frac{d}{\epsilon})$ with an inverse-linear dependence on the accuracy parameter, and (ii) a term that depends on the statistical complexity of the class of $\textit{linear}$ functions (captured by the Rademacher complexity). The proof builds a mechanism for controlling the behavior of stochastic convex optimization problems.  ( 2 min )
    Training neural operators to preserve invariant measures of chaotic attractors. (arXiv:2306.01187v2 [cs.LG] UPDATED)
    Chaotic systems make long-horizon forecasts difficult because small perturbations in initial conditions cause trajectories to diverge at an exponential rate. In this setting, neural operators trained to minimize squared error losses, while capable of accurate short-term forecasts, often fail to reproduce statistical or structural properties of the dynamics over longer time horizons and can yield degenerate results. In this paper, we propose an alternative framework designed to preserve invariant measures of chaotic attractors that characterize the time-invariant statistical properties of the dynamics. Specifically, in the multi-environment setting (where each sample trajectory is governed by slightly different dynamics), we consider two novel approaches to training with noisy data. First, we propose a loss based on the optimal transport distance between the observed dynamics and the neural operator outputs. This approach requires expert knowledge of the underlying physics to determine what statistical features should be included in the optimal transport loss. Second, we show that a contrastive learning framework, which does not require any specialized prior knowledge, can preserve statistical properties of the dynamics nearly as well as the optimal transport approach. On a variety of chaotic systems, our method is shown empirically to preserve invariant measures of chaotic attractors.
    FDAPT: Federated Domain-adaptive Pre-training for Language Models. (arXiv:2307.06933v2 [cs.LG] UPDATED)
    Foundation models (FMs) have shown prominent success in a wide range of tasks. Their applicability to specific domain-task pairings relies on the availability of, both, high-quality data and significant computational resources. These challenges are not new to the field and, indeed, Federated Learning (FL) has been shown to be a promising solution in similar setups. This paper tackles the specific case of Domain-Adaptive Pre-Training (DAPT), a key step in the application of FMs. We conduct the first comprehensive empirical study to evaluate the performance of Federated Domain-Adaptive Pre-Training (FDAPT). We demonstrate that FDAPT can maintain competitive downstream task performance to the centralized baseline in both IID and non-IID situations. Finally, we propose a novel algorithm, Frozen Federated Domain-Adaptive Pre-Training (FFDAPT). FFDAPT improves the computational efficiency by 12.1% on average and exhibits similar downstream task performance to vanilla FDAPT, with general performance fluctuations remaining less than 1%.
    Generalization in medical AI: a perspective on developing scalable models. (arXiv:2311.05418v1 [cs.LG])
    Over the past few years, research has witnessed the advancement of deep learning models trained on large datasets, some even encompassing millions of examples. While these impressive performance on their hidden test sets, they often underperform when assessed on external datasets. Recognizing the critical role of generalization in medical AI development, many prestigious journals now require reporting results both on the local hidden test set as well as on external datasets before considering a study for publication. Effectively, the field of medical AI has transitioned from the traditional usage of a single dataset that is split into train and test to a more comprehensive framework using multiple datasets, some of which are used for model development (source domain) and others for testing (target domains). However, this new experimental setting does not necessarily resolve the challenge of generalization. This is because of the variability encountered in intended use and specificities across hospital cultures making the idea of universally generalizable systems a myth. On the other hand, the systematic, and a fortiori recurrent re-calibration, of models at the individual hospital level, although ideal, may be overoptimistic given the legal, regulatory and technical challenges that are involved. Re-calibration using transfer learning may not even be possible in some instances where reference labels of target domains are not available. In this perspective we establish a hierarchical three-level scale system reflecting the generalization level of a medical AI algorithm. This scale better reflects the diversity of real-world medical scenarios per which target domain data for re-calibration of models may or not be available and if it is, may or not have reference labels systematically available.
    Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing. (arXiv:2306.12929v2 [cs.LG] UPDATED)
    Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational time and memory consumption of neural networks. Many studies have shown, however, that modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. To retain acceptable performance, the existence of these outliers requires activations to be in higher bitwidth or the use of different numeric formats, extra fine-tuning, or other workarounds. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. To achieve the exact zeros needed in the attention matrix for a no-update, the input to the softmax is pushed to be larger and larger during training, causing outliers in other parts of the network. Based on these observations, we propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention. We empirically show that models pre-trained using our methods learn significantly smaller outliers while maintaining and sometimes even improving the floating-point task performance. This enables us to quantize transformers to full INT8 quantization of the activations without any additional effort. We demonstrate the effectiveness of our methods on both language models (BERT, OPT) and vision transformers.  ( 3 min )
    SigScatNet: A Siamese + Scattering based Deep Learning Approach for Signature Forgery Detection and Similarity Assessment. (arXiv:2311.05579v1 [cs.CV])
    The surge in counterfeit signatures has inflicted widespread inconveniences and formidable challenges for both individuals and organizations. This groundbreaking research paper introduces SigScatNet, an innovative solution to combat this issue by harnessing the potential of a Siamese deep learning network, bolstered by Scattering wavelets, to detect signature forgery and assess signature similarity. The Siamese Network empowers us to ascertain the authenticity of signatures through a comprehensive similarity index, enabling precise validation and comparison. Remarkably, the integration of Scattering wavelets endows our model with exceptional efficiency, rendering it light enough to operate seamlessly on cost-effective hardware systems. To validate the efficacy of our approach, extensive experimentation was conducted on two open-sourced datasets: the ICDAR SigComp Dutch dataset and the CEDAR dataset. The experimental results demonstrate the practicality and resounding success of our proposed SigScatNet, yielding an unparalleled Equal Error Rate of 3.689% with the ICDAR SigComp Dutch dataset and an astonishing 0.0578% with the CEDAR dataset. Through the implementation of SigScatNet, our research spearheads a new state-of-the-art in signature analysis in terms of EER scores and computational efficiency, offering an advanced and accessible solution for detecting forgery and quantifying signature similarities. By employing cutting-edge Siamese deep learning and Scattering wavelets, we provide a robust framework that paves the way for secure and efficient signature verification systems.
    Geometry-Calibrated DRO: Combating Over-Pessimism with Free Energy Implications. (arXiv:2311.05054v1 [cs.LG])
    Machine learning algorithms minimizing average risk are susceptible to distributional shifts. Distributionally Robust Optimization (DRO) addresses this issue by optimizing the worst-case risk within an uncertainty set. However, DRO suffers from over-pessimism, leading to low-confidence predictions, poor parameter estimations as well as poor generalization. In this work, we conduct a theoretical analysis of a probable root cause of over-pessimism: excessive focus on noisy samples. To alleviate the impact of noise, we incorporate data geometry into calibration terms in DRO, resulting in our novel Geometry-Calibrated DRO (GCDRO) for regression. We establish the connection between our risk objective and the Helmholtz free energy in statistical physics, and this free-energy-based risk can extend to standard DRO methods. Leveraging gradient flow in Wasserstein space, we develop an approximate minimax optimization algorithm with a bounded error ratio and elucidate how our approach mitigates noisy sample effects. Comprehensive experiments confirm GCDRO's superiority over conventional DRO methods.
    MLRegTest: A Benchmark for the Machine Learning of Regular Languages. (arXiv:2304.07687v2 [cs.LG] UPDATED)
    Evaluating machine learning (ML) systems on their ability to learn known classifiers allows fine-grained examination of the patterns they can learn, which builds confidence when they are applied to the learning of unknown classifiers. This article presents a new benchmark for ML systems on sequence classification called MLRegTest, which contains training, development, and test sets from 1,800 regular languages. Different kinds of formal languages represent different kinds of long-distance dependencies, and correctly identifying long-distance dependencies in sequences is a known challenge for ML systems to generalize successfully. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies. Finally, the performance of different neural networks (simple RNN, LSTM, GRU, transformer) on MLRegTest is examined. The main conclusion is that their performance depends significantly on the kind of test set, the class of language, and the neural network architecture.
    Contrastive Training of Complex-Valued Autoencoders for Object Discovery. (arXiv:2305.15001v3 [cs.LG] UPDATED)
    Current state-of-the-art object-centric models use slots and attention-based routing for binding. However, this class of models has several conceptual limitations: the number of slots is hardwired; all slots have equal capacity; training has high computational cost; there are no object-level relational factors within slots. Synchrony-based models in principle can address these limitations by using complex-valued activations which store binding information in their phase components. However, working examples of such synchrony-based models have been developed only very recently, and are still limited to toy grayscale datasets and simultaneous storage of less than three objects in practice. Here we introduce architectural modifications and a novel contrastive learning method that greatly improve the state-of-the-art synchrony-based model. For the first time, we obtain a class of synchrony-based models capable of discovering objects in an unsupervised manner in multi-object color datasets and simultaneously representing more than three objects.
    The training accuracy of two-layer neural networks: its estimation and understanding using random datasets. (arXiv:2010.13380v2 [cs.LG] UPDATED)
    Although the neural network (NN) technique plays an important role in machine learning, understanding the mechanism of NN models and the transparency of deep learning still require more basic research. In this study, we propose a novel theory based on space partitioning to estimate the approximate training accuracy for two-layer neural networks on random datasets without training. There appear to be no other studies that have proposed a method to estimate training accuracy without using input data and/or trained models. Our method estimates the training accuracy for two-layer fully-connected neural networks on two-class random datasets using only three arguments: the dimensionality of inputs (d), the number of inputs (N), and the number of neurons in the hidden layer (L). We have verified our method using real training accuracies in our experiments. The results indicate that the method will work for any dimension, and the proposed theory could extend also to estimate deeper NN models. The main purpose of this paper is to understand the mechanism of NN models by the approach of estimating training accuracy but not to analyze their generalization nor their performance in real-world applications. This study may provide a starting point for a new way for researchers to make progress on the difficult problem of understanding deep learning.
    Prediction-Powered Inference. (arXiv:2301.09633v4 [stat.ML] UPDATED)
    Prediction-powered inference is a framework for performing valid statistical inference when an experimental dataset is supplemented with predictions from a machine-learning system. The framework yields simple algorithms for computing provably valid confidence intervals for quantities such as means, quantiles, and linear and logistic regression coefficients, without making any assumptions on the machine-learning algorithm that supplies the predictions. Furthermore, more accurate predictions translate to smaller confidence intervals. Prediction-powered inference could enable researchers to draw valid and more data-efficient conclusions using machine learning. The benefits of prediction-powered inference are demonstrated with datasets from proteomics, astronomy, genomics, remote sensing, census analysis, and ecology.  ( 2 min )
    AdjointBackMapV2: Precise Reconstruction of Arbitrary CNN Unit's Activation via Adjoint Operators. (arXiv:2110.01736v2 [cs.LG] UPDATED)
    Adjoint operators have been found to be effective in the exploration of CNN's inner workings [1]. However, the previous no-bias assumption restricted its generalization. We overcome the restriction via embedding input images into an extended normed space that includes bias in all CNN layers as part of the extended space and propose an adjoint-operator-based algorithm that maps high-level weights back to the extended input space for reconstructing an effective hypersurface. Such hypersurface can be computed for an arbitrary unit in the CNN, and we prove that this reconstructed hypersurface, when multiplied by the original input (through an inner product), will precisely replicate the output value of each unit. We show experimental results based on the CIFAR-10 and CIFAR-100 data sets where the proposed approach achieves near 0 activation value reconstruction error.  ( 2 min )
    Principled Pruning of Bayesian Neural Networks through Variational Free Energy Minimization. (arXiv:2210.09134v3 [cs.LG] UPDATED)
    Bayesian model reduction provides an efficient approach for comparing the performance of all nested sub-models of a model, without re-evaluating any of these sub-models. Until now, Bayesian model reduction has been applied mainly in the computational neuroscience community on simple models. In this paper, we formulate and apply Bayesian model reduction to perform principled pruning of Bayesian neural networks, based on variational free energy minimization. Direct application of Bayesian model reduction, however, gives rise to approximation errors. Therefore, a novel iterative pruning algorithm is presented to alleviate the problems arising with naive Bayesian model reduction, as supported experimentally on the publicly available UCI datasets for different inference algorithms. This novel parameter pruning scheme solves the shortcomings of current state-of-the-art pruning methods that are used by the signal processing community. The proposed approach has a clear stopping criterion and minimizes the same objective that is used during training. Next to these benefits, our experiments indicate better model performance in comparison to state-of-the-art pruning schemes.
    Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models. (arXiv:2310.09949v2 [cs.LG] UPDATED)
    A Retrieval-Augmented Language Model (RALM) augments a generative language model by retrieving context-specific knowledge from an external database. This strategy facilitates impressive text generation quality even with smaller models, thus reducing orders of magnitude of computational demands. However, RALMs introduce unique system design challenges due to (a) the diverse workload characteristics between LM inference and retrieval and (b) the various system requirements and bottlenecks for different RALM configurations such as model sizes, database sizes, and retrieval frequencies. We propose Chameleon, a heterogeneous accelerator system that integrates both LM and retrieval accelerators in a disaggregated architecture. The heterogeneity ensures efficient acceleration of both LM inference and retrieval, while the accelerator disaggregation enables the system to independently scale both types of accelerators to fulfill diverse RALM requirements. Our Chameleon prototype implements retrieval accelerators on FPGAs and assigns LM inference to GPUs, with a CPU server orchestrating these accelerators over the network. Compared to CPU-based and CPU-GPU vector search systems, Chameleon achieves up to 23.72x speedup and 26.2x energy efficiency. Evaluated on various RALMs, Chameleon exhibits up to 2.16x reduction in latency and 3.18x speedup in throughput compared to the hybrid CPU-GPU architecture. These promising results pave the way for bringing accelerator heterogeneity and disaggregation into future RALM systems.
    Fair Wasserstein Coresets. (arXiv:2311.05436v1 [stat.ML])
    Recent technological advancements have given rise to the ability of collecting vast amounts of data, that often exceed the capacity of commonly used machine learning algorithms. Approaches such as coresets and synthetic data distillation have emerged as frameworks to generate a smaller, yet representative, set of samples for downstream training. As machine learning is increasingly applied to decision-making processes, it becomes imperative for modelers to consider and address biases in the data concerning subgroups defined by factors like race, gender, or other sensitive attributes. Current approaches focus on creating fair synthetic representative samples by optimizing local properties relative to the original samples. These methods, however, are not guaranteed to positively affect the performance or fairness of downstream learning processes. In this work, we present Fair Wasserstein Coresets (FWC), a novel coreset approach which generates fair synthetic representative samples along with sample-level weights to be used in downstream learning tasks. FWC aims to minimize the Wasserstein distance between the original datasets and the weighted synthetic samples while enforcing (an empirical version of) demographic parity, a prominent criterion for algorithmic fairness, via a linear constraint. We show that FWC can be thought of as a constrained version of Lloyd's algorithm for k-medians or k-means clustering. Our experiments, conducted on both synthetic and real datasets, demonstrate the scalability of our approach and highlight the competitive performance of FWC compared to existing fair clustering approaches, even when attempting to enhance the fairness of the latter through fair pre-processing techniques.
    Explainable Identification of Hate Speech towards Islam using Graph Neural Networks. (arXiv:2311.04916v1 [cs.CL])
    Islamophobic language is a prevalent challenge on online social interaction platforms. Identifying and eliminating such hatred is a crucial step towards a future of harmony and peace. This study presents a novel paradigm for identifying and explaining hate speech towards Islam using graph neural networks. Utilizing the intrinsic ability of graph neural networks to find, extract, and use relationships across disparate data points, our model consistently achieves outstanding performance while offering explanations for the underlying correlations and causation.
    LCM-LoRA: A Universal Stable-Diffusion Acceleration Module. (arXiv:2311.05556v1 [cs.CV])
    Latent Consistency Models (LCMs) have achieved impressive performance in accelerating text-to-image generative tasks, producing high-quality images with minimal inference steps. LCMs are distilled from pre-trained latent diffusion models (LDMs), requiring only ~32 A100 GPU training hours. This report further extends LCMs' potential in two aspects: First, by applying LoRA distillation to Stable-Diffusion models including SD-V1.5, SSD-1B, and SDXL, we have expanded LCM's scope to larger models with significantly less memory consumption, achieving superior image generation quality. Second, we identify the LoRA parameters obtained through LCM distillation as a universal Stable-Diffusion acceleration module, named LCM-LoRA. LCM-LoRA can be directly plugged into various Stable-Diffusion fine-tuned models or LoRAs without training, thus representing a universally applicable accelerator for diverse image generation tasks. Compared with previous numerical PF-ODE solvers such as DDIM, DPM-Solver, LCM-LoRA can be viewed as a plug-in neural PF-ODE solver that possesses strong generalization abilities. Project page: https://github.com/luosiallen/latent-consistency-model.  ( 2 min )
    Differentiable Random Partition Models. (arXiv:2305.16841v2 [cs.LG] UPDATED)
    Partitioning a set of elements into an unknown number of mutually exclusive subsets is essential in many machine learning problems. However, assigning elements, such as samples in a dataset or neurons in a network layer, to an unknown and discrete number of subsets is inherently non-differentiable, prohibiting end-to-end gradient-based optimization of parameters. We overcome this limitation by proposing a novel two-step method for inferring partitions, which allows its usage in variational inference tasks. This new approach enables reparameterized gradients with respect to the parameters of the new random partition model. Our method works by inferring the number of elements per subset and, second, by filling these subsets in a learned order. We highlight the versatility of our general-purpose approach on three different challenging experiments: variational clustering, inference of shared and independent generative factors under weak supervision, and multitask learning.
    Topological Learning in Multi-Class Data Sets. (arXiv:2301.09734v2 [cs.LG] UPDATED)
    We specialize techniques from topological data analysis to the problem of characterizing the topological complexity (as defined in the body of the paper) of a multi-class data set. As a by-product, a topological classifier is defined that uses an open sub-covering of the data set. This sub-covering can be used to construct a simplicial complex whose topological features (e.g., Betti numbers) provide information about the classification problem. We use these topological constructs to study the impact of topological complexity on learning in feedforward deep neural networks (DNNs). We hypothesize that topological complexity is negatively correlated with the ability of a fully connected feedforward deep neural network to learn to classify data correctly. We evaluate our topological classification algorithm on multiple constructed and open source data sets. We also validate our hypothesis regarding the relationship between topological complexity and learning in DNN's on multiple data sets.
    Why Batch Normalization Damage Federated Learning on Non-IID Data?. (arXiv:2301.02982v3 [cs.LG] UPDATED)
    As a promising distributed learning paradigm, federated learning (FL) involves training deep neural network (DNN) models at the network edge while protecting the privacy of the edge clients. To train a large-scale DNN model, batch normalization (BN) has been regarded as a simple and effective means to accelerate the training and improve the generalization capability. However, recent findings indicate that BN can significantly impair the performance of FL in the presence of non-i.i.d. data. While several FL algorithms have been proposed to address this issue, their performance still falls significantly when compared to the centralized scheme. Furthermore, none of them have provided a theoretical explanation of how the BN damages the FL convergence. In this paper, we present the first convergence analysis to show that under the non-i.i.d. data, the mismatch between the local and global statistical parameters in BN causes the gradient deviation between the local and global models, which, as a result, slows down and biases the FL convergence. In view of this, we develop a new FL algorithm that is tailored to BN, called FedTAN, which is capable of achieving robust FL performance under a variety of data distributions via iterative layer-wise parameter aggregation. Comprehensive experimental results demonstrate the superiority of the proposed FedTAN over existing baselines for training BN-based DNN models.
    Reducing the Side-Effects of Oscillations in Training of Quantized YOLO Networks. (arXiv:2311.05109v1 [cs.CV])
    Quantized networks use less computational and memory resources and are suitable for deployment on edge devices. While quantization-aware training QAT is the well-studied approach to quantize the networks at low precision, most research focuses on over-parameterized networks for classification with limited studies on popular and edge device friendly single-shot object detection and semantic segmentation methods like YOLO. Moreover, majority of QAT methods rely on Straight-through Estimator (STE) approximation which suffers from an oscillation phenomenon resulting in sub-optimal network quantization. In this paper, we show that it is difficult to achieve extremely low precision (4-bit and lower) for efficient YOLO models even with SOTA QAT methods due to oscillation issue and existing methods to overcome this problem are not effective on these models. To mitigate the effect of oscillation, we first propose Exponentially Moving Average (EMA) based update to the QAT model. Further, we propose a simple QAT correction method, namely QC, that takes only a single epoch of training after standard QAT procedure to correct the error induced by oscillating weights and activations resulting in a more accurate quantized model. With extensive evaluation on COCO dataset using various YOLO5 and YOLO7 variants, we show that our correction method improves quantized YOLO networks consistently on both object detection and segmentation tasks at low-precision (4-bit and 3-bit).  ( 2 min )
    Spatial Attention-based Distribution Integration Network for Human Pose Estimation. (arXiv:2311.05323v1 [cs.CV])
    In recent years, human pose estimation has made significant progress through the implementation of deep learning techniques. However, these techniques still face limitations when confronted with challenging scenarios, including occlusion, diverse appearances, variations in illumination, and overlap. To cope with such drawbacks, we present the Spatial Attention-based Distribution Integration Network (SADI-NET) to improve the accuracy of localization in such situations. Our network consists of three efficient models: the receptive fortified module (RFM), spatial fusion module (SFM), and distribution learning module (DLM). Building upon the classic HourglassNet architecture, we replace the basic block with our proposed RFM. The RFM incorporates a dilated residual block and attention mechanism to expand receptive fields while enhancing sensitivity to spatial information. In addition, the SFM incorporates multi-scale characteristics by employing both global and local attention mechanisms. Furthermore, the DLM, inspired by residual log-likelihood estimation (RLE), reconfigures a predicted heatmap using a trainable distribution weight. For the purpose of determining the efficacy of our model, we conducted extensive experiments on the MPII and LSP benchmarks. Particularly, our model obtained a remarkable $92.10\%$ percent accuracy on the MPII test dataset, demonstrating significant improvements over existing models and establishing state-of-the-art performance.
    Progressive Ensemble Distillation: Building Ensembles for Efficient Inference. (arXiv:2302.10093v2 [cs.LG] UPDATED)
    We study the problem of progressive ensemble distillation: Given a large, pretrained teacher model $g$, we seek to decompose the model into smaller, low-inference cost student models $f_i$, such that progressively evaluating additional models in this ensemble leads to improved predictions. The resulting ensemble allows for flexibly tuning accuracy vs. inference cost at runtime, which is useful for a number of applications in on-device inference. The method we propose, B-DISTIL , relies on an algorithmic procedure that uses function composition over intermediate activations to construct expressive ensembles with similar performance as $g$ , but with smaller student models. We demonstrate the effectiveness of B-DISTIL by decomposing pretrained models across standard image, speech, and sensor datasets. We also provide theoretical guarantees in terms of convergence and generalization.
    Parkinson's Disease Detection through Vocal Biomarkers and Advanced Machine Learning Algorithms: A Comprehensive Study. (arXiv:2311.05435v1 [cs.LG])
    Parkinson's disease (PD) is a prevalent neurodegenerative disorder known for its impact on motor neurons, causing symptoms like tremors, stiffness, and gait difficulties. This study explores the potential of vocal feature alterations in PD patients as a means of early disease prediction. This research aims to predict the onset of Parkinson's disease. Utilizing a variety of advanced machine-learning algorithms, including XGBoost, LightGBM, Bagging, AdaBoost, and Support Vector Machine, among others, the study evaluates the predictive performance of these models using metrics such as accuracy, area under the curve (AUC), sensitivity, and specificity. The findings of this comprehensive analysis highlight LightGBM as the most effective model, achieving an impressive accuracy rate of 96%, alongside a matching AUC of 96%. LightGBM exhibited a remarkable sensitivity of 100% and specificity of 94.43%, surpassing other machine learning algorithms in accuracy and AUC scores. Given the complexities of Parkinson's disease and its challenges in early diagnosis, this study underscores the significance of leveraging vocal biomarkers coupled with advanced machine-learning techniques for precise and timely PD detection.
    Joint Feature and Differentiable $ k $-NN Graph Learning using Dirichlet Energy. (arXiv:2305.12396v2 [cs.LG] UPDATED)
    Feature selection (FS) plays an important role in machine learning, which extracts important features and accelerates the learning process. In this paper, we propose a deep FS method that simultaneously conducts feature selection and differentiable $ k $-NN graph learning based on the Dirichlet Energy. The Dirichlet Energy identifies important features by measuring their smoothness on the graph structure, and facilitates the learning of a new graph that reflects the inherent structure in new feature subspace. We employ Optimal Transport theory to address the non-differentiability issue of learning $ k $-NN graphs in neural networks, which theoretically makes our method applicable to other graph neural networks for dynamic graph learning. Furthermore, the proposed framework is interpretable, since all modules are designed algorithmically. We validate the effectiveness of our model with extensive experiments on both synthetic and real-world datasets.
    MathNAS: If Blocks Have a Role in Mathematical Architecture Design. (arXiv:2311.04943v1 [cs.LG])
    Neural Architecture Search (NAS) has emerged as a favoured method for unearthing effective neural architectures. Recent development of large models has intensified the demand for faster search speeds and more accurate search results. However, designing large models by NAS is challenging due to the dramatical increase of search space and the associated huge performance evaluation cost. Consider a typical modular search space widely used in NAS, in which a neural architecture consists of $m$ block nodes and a block node has $n$ alternative blocks. Facing the space containing $n^m$ candidate networks, existing NAS methods attempt to find the best one by searching and evaluating candidate networks directly.Different from the general strategy that takes architecture search as a whole problem, we propose a novel divide-and-conquer strategy by making use of the modular nature of the search space.Here, we introduce MathNAS, a general NAS framework based on mathematical programming.In MathNAS, the performances of the $m*n$ possible building blocks in the search space are calculated first, and then the performance of a network is directly predicted based on the performances of its building blocks. Although estimating block performances involves network training, just as what happens for network performance evaluation in existing NAS methods, predicting network performance is completely training-free and thus extremely fast. In contrast to the $n^m$ candidate networks to evaluate in existing NAS methods, which require training and a formidable computational burden, there are only $m*n$ possible blocks to handle in MathNAS. Therefore, our approach effectively reduces the complexity of network performance evaluation.Our code is available at https://github.com/wangqinsi1/MathNAS.
    Top-Tuning: a study on transfer learning for an efficient alternative to fine tuning for image classification with fast kernel methods. (arXiv:2209.07932v3 [cs.LG] UPDATED)
    The impressive performance of deep learning architectures is associated with a massive increase in model complexity. Millions of parameters need to be tuned, with training and inference time scaling accordingly, together with energy consumption. But is massive fine-tuning always necessary? In this paper, focusing on image classification, we consider a simple transfer learning approach exploiting pre-trained convolutional features as input for a fast-to-train kernel method. We refer to this approach as \textit{top-tuning} since only the kernel classifier is trained on the target dataset. In our study, we perform more than 3000 training processes focusing on 32 small to medium-sized target datasets, a typical situation where transfer learning is necessary. We show that the top-tuning approach provides comparable accuracy with respect to fine-tuning, with a training time between one and two orders of magnitude smaller. These results suggest that top-tuning is an effective alternative to fine-tuning in small/medium datasets, being especially useful when training time efficiency and computational resources saving are crucial.
    Straggler-Resilient Differentially-Private Decentralized Learning. (arXiv:2212.03080v2 [cs.LG] UPDATED)
    We consider the straggler problem in decentralized learning over a logical ring while preserving user data privacy. Especially, we extend the recently proposed framework of differential privacy (DP) amplification by decentralization by Cyffers and Bellet to include overall training latency--comprising both computation and communication latency. Analytical results on both the convergence speed and the DP level are derived for both a skipping scheme (which ignores the stragglers after a timeout) and a baseline scheme that waits for each node to finish before the training continues. A trade-off between overall training latency, accuracy, and privacy, parameterized by the timeout of the skipping scheme, is identified and empirically validated for logistic regression on a real-world dataset and for image classification using the MNIST and CIFAR-10 datasets.  ( 2 min )
    A Deep Learning Method for Simultaneous Denoising and Missing Wedge Reconstruction in Cryogenic Electron Tomography. (arXiv:2311.05539v1 [cs.CV])
    Cryogenic electron tomography (cryo-ET) is a technique for imaging biological samples such as viruses, cells, and proteins in 3D. A microscope collects a series of 2D projections of the sample, and the goal is to reconstruct the 3D density of the sample called the tomogram. This is difficult as the 2D projections have a missing wedge of information and are noisy. Tomograms reconstructed with conventional methods, such as filtered back-projection, suffer from the noise, and from artifacts and anisotropic resolution due to the missing wedge of information. To improve the visual quality and resolution of such tomograms, we propose a deep-learning approach for simultaneous denoising and missing wedge reconstruction called DeepDeWedge. DeepDeWedge is based on fitting a neural network to the 2D projections with a self-supervised loss inspired by noise2noise-like methods. The algorithm requires no training or ground truth data. Experiments on synthetic and real cryo-ET data show that DeepDeWedge achieves competitive performance for deep learning-based denoising and missing wedge reconstruction of cryo-ET tomograms.
    Facing Off World Model Backbones: RNNs, Transformers, and S4. (arXiv:2307.02064v2 [cs.LG] UPDATED)
    World models are a fundamental component in model-based reinforcement learning (MBRL). To perform temporally extended and consistent simulations of the future in partially observable environments, world models need to possess long-term memory. However, state-of-the-art MBRL agents, such as Dreamer, predominantly employ recurrent neural networks (RNNs) as their world model backbone, which have limited memory capacity. In this paper, we seek to explore alternative world model backbones for improving long-term memory. In particular, we investigate the effectiveness of Transformers and Structured State Space Sequence (S4) models, motivated by their remarkable ability to capture long-range dependencies in low-dimensional sequences and their complementary strengths. We propose S4WM, the first world model compatible with parallelizable SSMs including S4 and its variants. By incorporating latent variable modeling, S4WM can efficiently generate high-dimensional image sequences through latent imagination. Furthermore, we extensively compare RNN-, Transformer-, and S4-based world models across four sets of environments, which we have tailored to assess crucial memory capabilities of world models, including long-term imagination, context-dependent recall, reward prediction, and memory-based reasoning. Our findings demonstrate that S4WM outperforms Transformer-based world models in terms of long-term memory, while exhibiting greater efficiency during training and imagination. These results pave the way for the development of stronger MBRL agents.
    Learning to Group Auxiliary Datasets for Molecule. (arXiv:2307.04052v2 [q-bio.BM] UPDATED)
    The limited availability of annotations in small molecule datasets presents a challenge to machine learning models. To address this, one common strategy is to collaborate with additional auxiliary datasets. However, having more data does not always guarantee improvements. Negative transfer can occur when the knowledge in the target dataset differs or contradicts that of the auxiliary molecule datasets. In light of this, identifying the auxiliary molecule datasets that can benefit the target dataset when jointly trained remains a critical and unresolved problem. Through an empirical analysis, we observe that combining graph structure similarity and task similarity can serve as a more reliable indicator for identifying high-affinity auxiliary datasets. Motivated by this insight, we propose MolGroup, which separates the dataset affinity into task and structure affinity to predict the potential benefits of each auxiliary molecule dataset. MolGroup achieves this by utilizing a routing mechanism optimized through a bi-level optimization framework. Empowered by the meta gradient, the routing mechanism is optimized toward maximizing the target dataset's performance and quantifies the affinity as the gating score. As a result, MolGroup is capable of predicting the optimal combination of auxiliary datasets for each target dataset. Our extensive experiments demonstrate the efficiency and effectiveness of MolGroup, showing an average improvement of 4.41%/3.47% for GIN/Graphormer trained with the group of molecule datasets selected by MolGroup on 11 target molecule datasets.
    Explained anomaly detection in text reviews: Can subjective scenarios be correctly evaluated?. (arXiv:2311.04948v1 [cs.CL])
    This paper presents a pipeline to detect and explain anomalous reviews in online platforms. The pipeline is made up of three modules and allows the detection of reviews that do not generate value for users due to either worthless or malicious composition. The classifications are accompanied by a normality score and an explanation that justifies the decision made. The pipeline's ability to solve the anomaly detection task was evaluated using different datasets created from a large Amazon database. Additionally, a study comparing three explainability techniques involving 241 participants was conducted to assess the explainability module. The study aimed to measure the impact of explanations on the respondents' ability to reproduce the classification model and their perceived usefulness. This work can be useful to automate tasks in review online platforms, such as those for electronic commerce, and offers inspiration for addressing similar problems in the field of anomaly detection in textual data. We also consider it interesting to have carried out a human evaluation of the capacity of different explainability techniques in a real and infrequent scenario such as the detection of anomalous reviews, as well as to reflect on whether it is possible to explain tasks as humanly subjective as this one.
    Disentangling Quantum and Classical Contributions in Hybrid Quantum Machine Learning Architectures. (arXiv:2311.05559v1 [quant-ph])
    Quantum computing offers the potential for superior computational capabilities, particularly for data-intensive tasks. However, the current state of quantum hardware puts heavy restrictions on input size. To address this, hybrid transfer learning solutions have been developed, merging pre-trained classical models, capable of handling extensive inputs, with variational quantum circuits. Yet, it remains unclear how much each component - classical and quantum - contributes to the model's results. We propose a novel hybrid architecture: instead of utilizing a pre-trained network for compression, we employ an autoencoder to derive a compressed version of the input data. This compressed data is then channeled through the encoder part of the autoencoder to the quantum component. We assess our model's classification capabilities against two state-of-the-art hybrid transfer learning architectures, two purely classical architectures and one quantum architecture. Their accuracy is compared across four datasets: Banknote Authentication, Breast Cancer Wisconsin, MNIST digits, and AudioMNIST. Our research suggests that classical components significantly influence classification in hybrid transfer learning, a contribution often mistakenly ascribed to the quantum element. The performance of our model aligns with that of a variational quantum circuit using amplitude embedding, positioning it as a feasible alternative.
    Exploring Emotion Expression Recognition in Older Adults Interacting with a Virtual Coach. (arXiv:2311.05567v1 [cs.CV])
    The EMPATHIC project aimed to design an emotionally expressive virtual coach capable of engaging healthy seniors to improve well-being and promote independent aging. One of the core aspects of the system is its human sensing capabilities, allowing for the perception of emotional states to provide a personalized experience. This paper outlines the development of the emotion expression recognition module of the virtual coach, encompassing data collection, annotation design, and a first methodological approach, all tailored to the project requirements. With the latter, we investigate the role of various modalities, individually and combined, for discrete emotion expression recognition in this context: speech from audio, and facial expressions, gaze, and head dynamics from video. The collected corpus includes users from Spain, France, and Norway, and was annotated separately for the audio and video channels with distinct emotional labels, allowing for a performance comparison across cultures and label types. Results confirm the informative power of the modalities studied for the emotional categories considered, with multimodal methods generally outperforming others (around 68% accuracy with audio labels and 72-74% with video labels). The findings are expected to contribute to the limited literature on emotion recognition applied to older adults in conversational human-machine interaction.  ( 3 min )
    Understanding Addition in Transformers. (arXiv:2310.13121v3 [cs.LG] UPDATED)
    Understanding the inner workings of machine learning models like Transformers is vital for their safe and ethical use. This paper presents an in-depth analysis of a one-layer Transformer model trained for n-digit integer addition. We reveal that the model divides the task into parallel, digit-specific streams and employs distinct algorithms for different digit positions. Our study also finds that the model starts calculations late but executes them rapidly. A rare use case with high loss is identified and explained. Overall, the model's algorithm is explained in detail. These findings are validated through rigorous testing and mathematical modeling, contributing to the broader works in Mechanistic Interpretability, AI safety, and alignment. Our approach opens the door for analyzing more complex tasks and multi-layer Transformer models.
    Lightweight Diffusion Models with Distillation-Based Block Neural Architecture Search. (arXiv:2311.04950v1 [cs.CV])
    Diffusion models have recently shown remarkable generation ability, achieving state-of-the-art performance in many tasks. However, the high computational cost is still a troubling problem for diffusion models. To tackle this problem, we propose to automatically remove the structural redundancy in diffusion models with our proposed Diffusion Distillation-based Block-wise Neural Architecture Search (DiffNAS). Specifically, given a larger pretrained teacher, we leverage DiffNAS to search for the smallest architecture which achieves on-par or even better performance than the teacher. Considering current diffusion models are based on UNet which naturally has a block-wise structure, we perform neural architecture search independently in each block, which largely reduces the search space. Different from previous block-wise NAS methods, DiffNAS contains a block-wise local search strategy and a retraining strategy with a joint dynamic loss. Concretely, during the search process, we block-wisely select the best subnet to avoid the unfairness brought by the global search strategy used in previous works. When retraining the searched architecture, we adopt a dynamic joint loss to maintain the consistency between supernet training and subnet retraining, which also provides informative objectives for each block and shortens the paths of gradient propagation. We demonstrate this joint loss can effectively improve model performance. We also prove the necessity of the dynamic adjustment of this loss. The experiments show that our method can achieve significant computational reduction, especially on latent diffusion models with about 50% MACs and Parameter reduction.
    Metropolis Sampling for Constrained Diffusion Models. (arXiv:2307.05439v2 [cs.LG] UPDATED)
    Denoising diffusion models have recently emerged as the predominant paradigm for generative modelling on image domains. In addition, their extension to Riemannian manifolds has facilitated a range of applications across the natural sciences. While many of these problems stand to benefit from the ability to specify arbitrary, domain-informed constraints, this setting is not covered by the existing (Riemannian) diffusion model methodology. Recent work has attempted to address this issue by constructing novel noising processes based on the reflected Brownian motion and logarithmic barrier methods. However, the associated samplers are either computationally burdensome or only apply to convex subsets of Euclidean space. In this paper, we introduce an alternative, simple noising scheme based on Metropolis sampling that affords substantial gains in computational efficiency and empirical performance compared to the earlier samplers. Of independent interest, we prove that this new process corresponds to a valid discretisation of the reflected Brownian motion. We demonstrate the scalability and flexibility of our approach on a range of problem settings with convex and non-convex constraints, including applications from geospatial modelling, robotics and protein design.  ( 2 min )
    A Coefficient Makes SVRG Effective. (arXiv:2311.05589v1 [cs.LG])
    Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang (2013), is a theoretically compelling optimization method. However, as Defazio & Bottou (2019) highlights, its effectiveness in deep learning is yet to be proven. In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks. Our analysis finds that, for deeper networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses. Inspired by this, we introduce a multiplicative coefficient $\alpha$ to control the strength and adjust it through a linear decay schedule. We name our method $\alpha$-SVRG. Our results show $\alpha$-SVRG better optimizes neural networks, consistently reducing training loss compared to both baseline and the standard SVRG across various architectures and image classification datasets. We hope our findings encourage further exploration into variance reduction techniques in deep learning. Code is available at https://github.com/davidyyd/alpha-SVRG.
    A theory of data variability in Neural Network Bayesian inference. (arXiv:2307.16695v2 [cond-mat.dis-nn] UPDATED)
    Bayesian inference and kernel methods are well established in machine learning. The neural network Gaussian process in particular provides a concept to investigate neural networks in the limit of infinitely wide hidden layers by using kernel and inference methods. Here we build upon this limit and provide a field-theoretic formalism which covers the generalization properties of infinitely wide networks. We systematically compute generalization properties of linear, non-linear, and deep non-linear networks for kernel matrices with heterogeneous entries. In contrast to currently employed spectral methods we derive the generalization properties from the statistical properties of the input, elucidating the interplay of input dimensionality, size of the training data set, and variability of the data. We show that data variability leads to a non-Gaussian action reminiscent of a ($\varphi^3+\varphi^4$)-theory. Using our formalism on a synthetic task and on MNIST we obtain a homogeneous kernel matrix approximation for the learning curve as well as corrections due to data variability which allow the estimation of the generalization properties and exact results for the bounds of the learning curves in the case of infinitely many training data points.
    SIRE: scale-invariant, rotation-equivariant estimation of artery orientations using graph neural networks. (arXiv:2311.05400v1 [cs.CV])
    Blood vessel orientation as visualized in 3D medical images is an important descriptor of its geometry that can be used for centerline extraction and subsequent segmentation and visualization. Arteries appear at many scales and levels of tortuosity, and determining their exact orientation is challenging. Recent works have used 3D convolutional neural networks (CNNs) for this purpose, but CNNs are sensitive to varying vessel sizes and orientations. We present SIRE: a scale-invariant, rotation-equivariant estimator for local vessel orientation. SIRE is modular and can generalise due to symmetry preservation. SIRE consists of a gauge equivariant mesh CNN (GEM-CNN) operating on multiple nested spherical meshes with different sizes in parallel. The features on each mesh are a projection of image intensities within the corresponding sphere. These features are intrinsic to the sphere and, in combination with the GEM-CNN, lead to SO(3)-equivariance. Approximate scale invariance is achieved by weight sharing and use of a symmetric maximum function to combine multi-scale predictions. Hence, SIRE can be trained with arbitrarily oriented vessels with varying radii to generalise to vessels with a wide range of calibres and tortuosity. We demonstrate the efficacy of SIRE using three datasets containing vessels of varying scales: the vascular model repository (VMR), the ASOCA coronary artery set, and a set of abdominal aortic aneurysms (AAAs). We embed SIRE in a centerline tracker which accurately tracks AAAs, regardless of the data SIRE is trained with. Moreover, SIRE can be used to track coronary arteries, even when trained only with AAAs. In conclusion, by incorporating SO(3) and scale symmetries, SIRE can determine the orientations of vessels outside of the training domain, forming a robust and data-efficient solution to geometric analysis of blood vessels in 3D medical images.
    Transformer-based Model for Oral Epithelial Dysplasia Segmentation. (arXiv:2311.05452v1 [eess.IV])
    Oral epithelial dysplasia (OED) is a premalignant histopathological diagnosis given to lesions of the oral cavity. OED grading is subject to large inter/intra-rater variability, resulting in the under/over-treatment of patients. We developed a new Transformer-based pipeline to improve detection and segmentation of OED in haematoxylin and eosin (H&E) stained whole slide images (WSIs). Our model was trained on OED cases (n = 260) and controls (n = 105) collected using three different scanners, and validated on test data from three external centres in the United Kingdom and Brazil (n = 78). Our internal experiments yield a mean F1-score of 0.81 for OED segmentation, which reduced slightly to 0.71 on external testing, showing good generalisability, and gaining state-of-the-art results. This is the first externally validated study to use Transformers for segmentation in precancerous histology images. Our publicly available model shows great promise to be the first step of a fully-integrated pipeline, allowing earlier and more efficient OED diagnosis, ultimately benefiting patient outcomes.  ( 2 min )
    Edge-assisted U-Shaped Split Federated Learning with Privacy-preserving for Internet of Things. (arXiv:2311.04944v1 [cs.LG])
    In the realm of the Internet of Things (IoT), deploying deep learning models to process data generated or collected by IoT devices is a critical challenge. However, direct data transmission can cause network congestion and inefficient execution, given that IoT devices typically lack computation and communication capabilities. Centralized data processing in data centers is also no longer feasible due to concerns over data privacy and security. To address these challenges, we present an innovative Edge-assisted U-Shaped Split Federated Learning (EUSFL) framework, which harnesses the high-performance capabilities of edge servers to assist IoT devices in model training and optimization process. In this framework, we leverage Federated Learning (FL) to enable data holders to collaboratively train models without sharing their data, thereby enhancing data privacy protection by transmitting only model parameters. Additionally, inspired by Split Learning (SL), we split the neural network into three parts using U-shaped splitting for local training on IoT devices. By exploiting the greater computation capability of edge servers, our framework effectively reduces overall training time and allows IoT devices with varying capabilities to perform training tasks efficiently. Furthermore, we proposed a novel noise mechanism called LabelDP to ensure that data features and labels can securely resist reconstruction attacks, eliminating the risk of privacy leakage. Our theoretical analysis and experimental results demonstrate that EUSFL can be integrated with various aggregation algorithms, maintaining good performance across different computing capabilities of IoT devices, and significantly reducing training time and local computation overhead.  ( 3 min )
    Optimization on Manifolds via Graph Gaussian Processes. (arXiv:2210.10962v3 [stat.ML] UPDATED)
    This paper integrates manifold learning techniques within a \emph{Gaussian process upper confidence bound} algorithm to optimize an objective function on a manifold. Our approach is motivated by applications where a full representation of the manifold is not available and querying the objective is expensive. We rely on a point cloud of manifold samples to define a graph Gaussian process surrogate model for the objective. Query points are sequentially chosen using the posterior distribution of the surrogate model given all previous queries. We establish regret bounds in terms of the number of queries and the size of the point cloud. Several numerical examples complement the theory and illustrate the performance of our method.  ( 2 min )
    Non-Convex Bilevel Optimization with Time-Varying Objective Functions. (arXiv:2308.03811v2 [math.OC] UPDATED)
    Bilevel optimization has become a powerful tool in a wide variety of machine learning problems. However, the current nonconvex bilevel optimization considers an offline dataset and static functions, which may not work well in emerging online applications with streaming data and time-varying functions. In this work, we study online bilevel optimization (OBO) where the functions can be time-varying and the agent continuously updates the decisions with online streaming data. To deal with the function variations and the unavailability of the true hypergradients in OBO, we propose a single-loop online bilevel optimizer with window averaging (SOBOW), which updates the outer-level decision based on a window average of the most recent hypergradient estimations stored in the memory. Compared to existing algorithms, SOBOW is computationally efficient and does not need to know previous functions. To handle the unique technical difficulties rooted in single-loop update and function variations for OBO, we develop a novel analytical technique that disentangles the complex couplings between decision variables, and carefully controls the hypergradient estimation error. We show that SOBOW can achieve a sublinear bilevel local regret under mild conditions. Extensive experiments across multiple domains corroborate the effectiveness of SOBOW.  ( 2 min )
    Efficient Parallelization Layouts for Large-Scale Distributed Model Training. (arXiv:2311.05610v1 [cs.LG])
    Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or sequence parallelism. In this work, we conduct a comprehensive ablation study of possible training configurations for large language models. We distill this large study into several key recommendations for the most efficient training. For instance, we find that using a micro-batch size of 1 usually enables the most efficient training layouts. Larger micro-batch sizes necessitate activation checkpointing or higher degrees of model parallelism and also lead to larger pipeline bubbles. Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes, most notably a Model FLOPs utilization of 70.5% when training a 13B model.
    Taxonomy for Resident Space Objects in LEO: A Deep Learning Approach. (arXiv:2311.05430v1 [cs.LG])
    The increasing number of RSOs has raised concerns about the risk of collisions and catastrophic incidents for all direct and indirect users of space. To mitigate this issue, it is essential to have a good understanding of the various RSOs in orbit and their behaviour. A well-established taxonomy defining several classes of RSOs is a critical step in achieving this understanding. This taxonomy helps assign objects to specific categories based on their main characteristics, leading to better tracking services. Furthermore, a well-established taxonomy can facilitate research and analysis processes by providing a common language and framework for better understanding the factors that influence RSO behaviour in space. These factors, in turn, help design more efficient and effective strategies for space traffic management. Our work proposes a new taxonomy for RSOs focusing on the low Earth orbit regime to enhance space traffic management. In addition, we present a deep learning-based model that uses an autoencoder architecture to reduce the features representing the characteristics of the RSOs. The autoencoder generates a lower-dimensional space representation that is then explored using techniques such as Uniform Manifold Approximation and Projection to identify fundamental clusters of RSOs based on their unique characteristics. This approach captures the complex and non-linear relationships between the features and the RSOs' classes identified. Our proposed taxonomy and model offer a significant contribution to the ongoing efforts to mitigate the overall risks posed by the increasing number of RSOs in orbit.  ( 2 min )
    Low-Resource Named Entity Recognition: Can One-vs-All AUC Maximization Help?. (arXiv:2311.04918v1 [cs.CL])
    Named entity recognition (NER), a task that identifies and categorizes named entities such as persons or organizations from text, is traditionally framed as a multi-class classification problem. However, this approach often overlooks the issues of imbalanced label distributions, particularly in low-resource settings, which is common in certain NER contexts, like biomedical NER (bioNER). To address these issues, we propose an innovative reformulation of the multi-class problem as a one-vs-all (OVA) learning problem and introduce a loss function based on the area under the receiver operating characteristic curve (AUC). To enhance the efficiency of our OVA-based approach, we propose two training strategies: one groups labels with similar linguistic characteristics, and another employs meta-learning. The superiority of our approach is confirmed by its performance, which surpasses traditional NER learning in varying NER settings.
    Meta-learning of semi-supervised learning from tasks with heterogeneous attribute spaces. (arXiv:2311.05088v1 [cs.LG])
    We propose a meta-learning method for semi-supervised learning that learns from multiple tasks with heterogeneous attribute spaces. The existing semi-supervised meta-learning methods assume that all tasks share the same attribute space, which prevents us from learning with a wide variety of tasks. With the proposed method, the expected test performance on tasks with a small amount of labeled data is improved with unlabeled data as well as data in various tasks, where the attribute spaces are different among tasks. The proposed method embeds labeled and unlabeled data simultaneously in a task-specific space using a neural network, and the unlabeled data's labels are estimated by adapting classification or regression models in the embedding space. For the neural network, we develop variable-feature self-attention layers, which enable us to find embeddings of data with different attribute spaces with a single neural network by considering interactions among examples, attributes, and labels. Our experiments on classification and regression datasets with heterogeneous attribute spaces demonstrate that our proposed method outperforms the existing meta-learning and semi-supervised learning methods.
    Mixture of Weak & Strong Experts on Graphs. (arXiv:2311.05185v1 [cs.LG])
    Realistic graphs contain both rich self-features of nodes and informative structures of neighborhoods, jointly handled by a GNN in the typical setup. We propose to decouple the two modalities by mixture of weak and strong experts (Mowst), where the weak expert is a light-weight Multi-layer Perceptron (MLP), and the strong expert is an off-the-shelf Graph Neural Network (GNN). To adapt the experts' collaboration to different target nodes, we propose a "confidence" mechanism based on the dispersion of the weak expert's prediction logits. The strong expert is conditionally activated when either the node's classification relies on neighborhood information, or the weak expert has low model quality. We reveal interesting training dynamics by analyzing the influence of the confidence function on loss: our training algorithm encourages the specialization of each expert by effectively generating soft splitting of the graph. In addition, our "confidence" design imposes a desirable bias toward the strong expert to benefit from GNN's better generalization capability. Mowst is easy to optimize and achieves strong expressive power, with a computation cost comparable to a single GNN. Empirically, Mowst shows significant accuracy improvement on 6 standard node classification benchmarks (including both homophilous and heterophilous graphs).
    A*Net: A Scalable Path-based Reasoning Approach for Knowledge Graphs. (arXiv:2206.04798v5 [cs.AI] UPDATED)
    Reasoning on large-scale knowledge graphs has been long dominated by embedding methods. While path-based methods possess the inductive capacity that embeddings lack, their scalability is limited by the exponential number of paths. Here we present A*Net, a scalable path-based method for knowledge graph reasoning. Inspired by the A* algorithm for shortest path problems, our A*Net learns a priority function to select important nodes and edges at each iteration, to reduce time and memory footprint for both training and inference. The ratio of selected nodes and edges can be specified to trade off between performance and efficiency. Experiments on both transductive and inductive knowledge graph reasoning benchmarks show that A*Net achieves competitive performance with existing state-of-the-art path-based methods, while merely visiting 10% nodes and 10% edges at each iteration. On a million-scale dataset ogbl-wikikg2, A*Net not only achieves a new state-of-the-art result, but also converges faster than embedding methods. A*Net is the first path-based method for knowledge graph reasoning at such scale.
    Variational Denoising for Variational Quantum Eigensolver. (arXiv:2304.00549v2 [quant-ph] UPDATED)
    The variational quantum eigensolver (VQE) is a hybrid algorithm that has the potential to provide a quantum advantage in practical chemistry problems that are currently intractable on classical computers. VQE trains parameterized quantum circuits using a classical optimizer to approximate the eigenvalues and eigenstates of a given Hamiltonian. However, VQE faces challenges in task-specific design and machine-specific architecture, particularly when running on noisy quantum devices. This can have a negative impact on its trainability, accuracy, and efficiency, resulting in noisy quantum data. We propose variational denoising, an unsupervised learning method that employs a parameterized quantum neural network to improve the solution of VQE by learning from noisy VQE outputs. Our approach can significantly decrease energy estimation errors and increase fidelities with ground states compared to noisy input data for the $\text{H}_2$, LiH, and $\text{BeH}_2$ molecular Hamiltonians, and the transverse field Ising model. Surprisingly, it only requires noisy data for training. Variational denoising can be integrated into quantum hardware, increasing its versatility as an end-to-end quantum processing for quantum data.
    Data Distillation for Neural Network Potentials toward Foundational Dataset. (arXiv:2311.05407v1 [physics.comp-ph])
    Machine learning (ML) techniques and atomistic modeling have rapidly transformed materials design and discovery. Specifically, generative models can swiftly propose promising materials for targeted applications. However, the predicted properties of materials through the generative models often do not match with calculated properties through ab initio calculations. This discrepancy can arise because the generated coordinates are not fully relaxed, whereas the many properties are derived from relaxed structures. Neural network-based potentials (NNPs) can expedite the process by providing relaxed structures from the initially generated ones. Nevertheless, acquiring data to train NNPs for this purpose can be extremely challenging as it needs to encompass previously unknown structures. This study utilized extended ensemble molecular dynamics (MD) to secure a broad range of liquid- and solid-phase configurations in one of the metallic systems, nickel. Then, we could significantly reduce them through active learning without losing much accuracy. We found that the NNP trained from the distilled data could predict different energy-minimized closed-pack crystal structures even though those structures were not explicitly part of the initial data. Furthermore, the data can be translated to other metallic systems (aluminum and niobium), without repeating the sampling and distillation processes. Our approach to data acquisition and distillation has demonstrated the potential to expedite NNP development and enhance materials design and discovery by integrating generative models.  ( 2 min )
    RepQ: Generalizing Quantization-Aware Training for Re-Parametrized Architectures. (arXiv:2311.05317v1 [cs.LG])
    Existing neural networks are memory-consuming and computationally intensive, making deploying them challenging in resource-constrained environments. However, there are various methods to improve their efficiency. Two such methods are quantization, a well-known approach for network compression, and re-parametrization, an emerging technique designed to improve model performance. Although both techniques have been studied individually, there has been limited research on their simultaneous application. To address this gap, we propose a novel approach called RepQ, which applies quantization to re-parametrized networks. Our method is based on the insight that the test stage weights of an arbitrary re-parametrized layer can be presented as a differentiable function of trainable parameters. We enable quantization-aware training by applying quantization on top of this function. RepQ generalizes well to various re-parametrized models and outperforms the baseline method LSQ quantization scheme in all experiments.  ( 2 min )
    Data Valuation and Detections in Federated Learning. (arXiv:2311.05304v1 [cs.LG])
    Federated Learning (FL) enables collaborative model training without sharing raw data, demanding abundant, high-quality data for optimal model performance. Fair and efficient data evaluation is a fundamental issue for incentivizing clients to provide more high-quality data. Meanwhile, it is likely that only a subset of clients and datasets are relevant for a learning task while the rest of them may have a negative impact on the model training. This paper introduces a novel privacy-preserving method for evaluating client contributions and selecting relevant data samples without a pre-specified training algorithm. Our proposed approach, FedBary, utilizes Wasserstein distance within the federated context, offering a new pioneering solution for data valuation, which provides transparent data evaluation and efficient computation of Wasserstein barycenter to mitigate reliance on validation data. We conduct extensive empirical experiments and theoretical analysis, showing the promising research of this valuation metric.
    Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset. (arXiv:2306.11167v4 [cs.CL] UPDATED)
    The quest for human imitative AI has been an enduring topic in AI research since its inception. The technical evolution and emerging capabilities of the latest cohort of large language models (LLMs) have reinvigorated the subject beyond academia to the cultural zeitgeist. While recent NLP evaluation benchmark tasks test some aspects of human-imitative behaviour (e.g., BIG-bench's 'human-like behavior' tasks), few, if not none, examine creative problem solving abilities. Creative problem solving in humans is a well-studied topic in cognitive neuroscience with standardized tests that predominantly use the ability to associate (heterogeneous) connections among clue words as a metric for creativity. Exposure to misleading stimuli - distractors dubbed red herrings - impede human performance in such tasks via the fixation effect and Einstellung paradigm. In cognitive neuroscience studies, such fixations are experimentally induced by pre-exposing participants to orthographically similar incorrect words to subsequent word-fragments or clues. The popular British quiz show Only Connect's Connecting Wall segment essentially mimics Mednick's Remote Associates Test (RAT) formulation with built-in, deliberate red herrings, which makes it an ideal proxy dataset to explore and study fixation effect and Einstellung paradigm from cognitive neuroscience in LLMs. In this paper we present the novel Only Connect Wall (OCW) dataset and report results from our evaluation of selected pre-trained language models and LLMs on creative problem solving tasks like grouping clue words by heterogeneous connections, and identifying correct open knowledge domain connections in respective groups. We synthetically generate two additional datasets: OCW-Randomized, OCW-WordNet to further analyze our red-herrings hypothesis in language models. The code and link to the dataset are available at https://github.com/TaatiTeam/OCW.
    Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO. (arXiv:2311.04951v1 [cs.LG])
    Inference optimizations are critical for improving user experience and reducing infrastructure costs and power consumption. In this article, we illustrate a form of dynamic execution known as speculative sampling to reduce the overall latency of text generation and compare it with standard autoregressive sampling. This can be used together with model-based optimizations (e.g. quantization) to provide an optimized solution. Both sampling methods make use of KV caching. A Jupyter notebook and some sample executions are provided.
    On neural and dimensional collapse in supervised and unsupervised contrastive learning with hard negative sampling. (arXiv:2311.05139v1 [cs.LG])
    For a widely-studied data model and general loss and sample-hardening functions we prove that the Supervised Contrastive Learning (SCL), Hard-SCL (HSCL), and Unsupervised Contrastive Learning (UCL) risks are minimized by representations that exhibit Neural Collapse (NC), i.e., the class means form an Equianglular Tight Frame (ETF) and data from the same class are mapped to the same representation. We also prove that for any representation mapping, the HSCL and Hard-UCL (HUCL) risks are lower bounded by the corresponding SCL and UCL risks. Although the optimality of ETF is known for SCL, albeit only for InfoNCE loss, its optimality for HSCL and UCL under general loss and hardening functions is novel. Moreover, our proofs are much simpler, compact, and transparent. We empirically demonstrate, for the first time, that ADAM optimization of HSCL and HUCL risks with random initialization and suitable hardness levels can indeed converge to the NC geometry if we incorporate unit-ball or unit-sphere feature normalization. Without incorporating hard negatives or feature normalization, however, the representations learned via ADAM suffer from dimensional collapse (DC) and fail to attain the NC geometry.  ( 2 min )
    GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech Recognition. (arXiv:2311.04996v1 [eess.AS])
    While Connectionist Temporal Classification (CTC) models deliver state-of-the-art accuracy in automated speech recognition (ASR) pipelines, their performance has been limited by CPU-based beam search decoding. We introduce a GPU-accelerated Weighted Finite State Transducer (WFST) beam search decoder compatible with current CTC models. It increases pipeline throughput and decreases latency, supports streaming inference, and also supports advanced features like utterance-specific word boosting via on-the-fly composition. We provide pre-built DLPack-based python bindings for ease of use with Python-based machine learning frameworks at https://github.com/nvidia-riva/riva-asrlib-decoder. We evaluated our decoder for offline and online scenarios, demonstrating that it is the fastest beam search decoder for CTC models. In the offline scenario it achieves up to 7 times more throughput than the current state-of-the-art CPU decoder and in the online streaming scenario, it achieves nearly 8 times lower latency, with same or better word error rate.  ( 2 min )
    Signal Temporal Logic-Guided Apprenticeship Learning. (arXiv:2311.05084v1 [cs.RO])
    Apprenticeship learning crucially depends on effectively learning rewards, and hence control policies from user demonstrations. Of particular difficulty is the setting where the desired task consists of a number of sub-goals with temporal dependencies. The quality of inferred rewards and hence policies are typically limited by the quality of demonstrations, and poor inference of these can lead to undesirable outcomes. In this letter, we show how temporal logic specifications that describe high level task objectives, are encoded in a graph to define a temporal-based metric that reasons about behaviors of demonstrators and the learner agent to improve the quality of inferred rewards and policies. Through experiments on a diverse set of robot manipulator simulations, we show how our framework overcomes the drawbacks of prior literature by drastically improving the number of demonstrations required to learn a control policy.  ( 2 min )
    Sparse PCA Beyond Covariance Thresholding. (arXiv:2302.10158v2 [cs.LG] UPDATED)
    In the Wishart model for sparse PCA we are given $n$ samples $Y_1,\ldots, Y_n$ drawn independently from a $d$-dimensional Gaussian distribution $N({0, Id + \beta vv^\top})$, where $\beta > 0$ and $v\in \mathbb{R}^d$ is a $k$-sparse unit vector, and we wish to recover $v$ (up to sign). We show that if $n \ge \Omega(d)$, then for every $t \ll k$ there exists an algorithm running in time $n\cdot d^{O(t)}$ that solves this problem as long as \[ \beta \gtrsim \frac{k}{\sqrt{nt}}\sqrt{\ln({2 + td/k^2})}\,. \] Prior to this work, the best polynomial time algorithm in the regime $k\approx \sqrt{d}$, called \emph{Covariance Thresholding} (proposed in [KNV15a] and analyzed in [DM14]), required $\beta \gtrsim \frac{k}{\sqrt{n}}\sqrt{\ln({2 + d/k^2})}$. For large enough constant $t$ our algorithm runs in polynomial time and has better guarantees than Covariance Thresholding. Previously known algorithms with such guarantees required quasi-polynomial time $d^{O(\log d)}$. In addition, we show that our techniques work with sparse PCA with adversarial perturbations studied in [dKNS20]. This model generalizes not only sparse PCA, but also other problems studied in prior works, including the sparse planted vector problem. As a consequence, we provide polynomial time algorithms for the sparse planted vector problem that have better guarantees than the state of the art in some regimes. Our approach also works with the Wigner model for sparse PCA. Moreover, we show that it is possible to combine our techniques with recent results on sparse PCA with symmetric heavy-tailed noise [dNNS22]. In particular, in the regime $k \approx \sqrt{d}$ we get the first polynomial time algorithm that works with symmetric heavy-tailed noise, while the algorithm from [dNNS22]. requires quasi-polynomial time in these settings.  ( 3 min )
    ABIGX: A Unified Framework for eXplainable Fault Detection and Classification. (arXiv:2311.05316v1 [cs.LG])
    For explainable fault detection and classification (FDC), this paper proposes a unified framework, ABIGX (Adversarial fault reconstruction-Based Integrated Gradient eXplanation). ABIGX is derived from the essentials of previous successful fault diagnosis methods, contribution plots (CP) and reconstruction-based contribution (RBC). It is the first explanation framework that provides variable contributions for the general FDC models. The core part of ABIGX is the adversarial fault reconstruction (AFR) method, which rethinks the FR from the perspective of adversarial attack and generalizes to fault classification models with a new fault index. For fault classification, we put forward a new problem of fault class smearing, which intrinsically hinders the correct explanation. We prove that ABIGX effectively mitigates this problem and outperforms the existing gradient-based explanation methods. For fault detection, we theoretically bridge ABIGX with conventional fault diagnosis methods by proving that CP and RBC are the linear specifications of ABIGX. The experiments evaluate the explanations of FDC by quantitative metrics and intuitive illustrations, the results of which show the general superiority of ABIGX to other advanced explanation methods.  ( 2 min )
    Using ResNet to Utilize 4-class T2-FLAIR Slice Classification Based on the Cholinergic Pathways Hyperintensities Scale for Pathological Aging. (arXiv:2311.05477v1 [eess.IV])
    The Cholinergic Pathways Hyperintensities Scale (CHIPS) is a visual rating scale used to assess the extent of cholinergic white matter hyperintensities in T2-FLAIR images, serving as an indicator of dementia severity. However, the manual selection of four specific slices for rating throughout the entire brain is a time-consuming process. Our goal was to develop a deep learning-based model capable of automatically identifying the four slices relevant to CHIPS. To achieve this, we trained a 4-class slice classification model (BSCA) using the ADNI T2-FLAIR dataset (N=150) with the assistance of ResNet. Subsequently, we tested the model's performance on a local dataset (N=30). The results demonstrated the efficacy of our model, with an accuracy of 99.82% and an F1-score of 99.83%. This achievement highlights the potential impact of BSCA as an automatic screening tool, streamlining the selection of four specific T2-FLAIR slices that encompass white matter landmarks along the cholinergic pathways. Clinicians can leverage this tool to assess the risk of clinical dementia development efficiently.
    Basis functions nonlinear data-enabled predictive control: Consistent and computationally efficient formulations. (arXiv:2311.05360v1 [eess.SY])
    This paper considers the extension of data-enabled predictive control (DeePC) to nonlinear systems via general basis functions. Firstly, we formulate a basis functions DeePC behavioral predictor and we identify necessary and sufficient conditions for equivalence with a corresponding basis functions multi-step identified predictor. The derived conditions yield a dynamic regularization cost function that enables a well-posed (i.e., consistent) basis functions formulation of nonlinear DeePC. To optimize computational efficiency of basis functions DeePC we further develop two alternative formulations that use a simpler, sparse regularization cost function and ridge regression, respectively. Consistency implications for Koopman DeePC as well as several methods for constructing the basis functions representation are also indicated. The effectiveness of the developed consistent basis functions DeePC formulations is illustrated on a benchmark nonlinear pendulum state-space model, for both noise free and noisy data.
    Contextualizing the Limits of Model & Evaluation Dataset Curation on Semantic Similarity Classification Tasks. (arXiv:2311.04927v1 [cs.CL])
    This paper demonstrates how the limitations of pre-trained models and open evaluation datasets factor into assessing the performance of binary semantic similarity classification tasks. As (1) end-user-facing documentation around the curation of these datasets and pre-trained model training regimes is often not easily accessible and (2) given the lower friction and higher demand to quickly deploy such systems in real-world contexts, our study reinforces prior work showing performance disparities across datasets, embedding techniques and distance metrics, while highlighting the importance of understanding how data is collected, curated and analyzed in semantic similarity classification.
    All Should Be Equal in the Eyes of Language Models: Counterfactually Aware Fair Text Generation. (arXiv:2311.05451v1 [cs.CL])
    Fairness in Language Models (LMs) remains a longstanding challenge, given the inherent biases in training data that can be perpetuated by models and affect the downstream tasks. Recent methods employ expensive retraining or attempt debiasing during inference by constraining model outputs to contrast from a reference set of biased templates or exemplars. Regardless, they dont address the primary goal of fairness to maintain equitability across different demographic groups. In this work, we posit that inferencing LMs to generate unbiased output for one demographic under a context ensues from being aware of outputs for other demographics under the same context. To this end, we propose Counterfactually Aware Fair InferencE (CAFIE), a framework that dynamically compares the model understanding of diverse demographics to generate more equitable sentences. We conduct an extensive empirical evaluation using base LMs of varying sizes and across three diverse datasets and found that CAFIE outperforms strong baselines. CAFIE produces fairer text and strikes the best balance between fairness and language modeling capability
    Quantum Generative Modeling of Sequential Data with Trainable Token Embedding. (arXiv:2311.05050v1 [cs.LG])
    Generative models are a class of machine learning models that aim to learn the underlying probability distribution of data. Unlike discriminative models, generative models focus on capturing the data's inherent structure, allowing them to generate new samples that resemble the original data. To fully exploit the potential of modeling probability distributions using quantum physics, a quantum-inspired generative model known as the Born machines have shown great advancements in learning classical and quantum data over matrix product state(MPS) framework. The Born machines support tractable log-likelihood, autoregressive and mask sampling, and have shown outstanding performance in various unsupervised learning tasks. However, much of the current research has been centered on improving the expressive power of MPS, predominantly embedding each token directly by a corresponding tensor index. In this study, we generalize the embedding method into trainable quantum measurement operators that can be simultaneously honed with MPS. Our study indicated that combined with trainable embedding, Born machines can exhibit better performance and learn deeper correlations from the dataset.
    Latent Task-Specific Graph Network Simulators. (arXiv:2311.05256v1 [cs.LG])
    Simulating dynamic physical interactions is a critical challenge across multiple scientific domains, with applications ranging from robotics to material science. For mesh-based simulations, Graph Network Simulators (GNSs) pose an efficient alternative to traditional physics-based simulators. Their inherent differentiability and speed make them particularly well-suited for inverse design problems. Yet, adapting to new tasks from limited available data is an important aspect for real-world applications that current methods struggle with. We frame mesh-based simulation as a meta-learning problem and use a recent Bayesian meta-learning method to improve GNSs adaptability to new scenarios by leveraging context data and handling uncertainties. Our approach, latent task-specific graph network simulator, uses non-amortized task posterior approximations to sample latent descriptions of unknown system properties. Additionally, we leverage movement primitives for efficient full trajectory prediction, effectively addressing the issue of accumulating errors encountered by previous auto-regressive methods. We validate the effectiveness of our approach through various experiments, performing on par with or better than established baseline methods. Movement primitives further allow us to accommodate various types of context data, as demonstrated through the utilization of point clouds during inference. By combining GNSs with meta-learning, we bring them closer to real-world applicability, particularly in scenarios with smaller datasets.
    Outlier-Robust Wasserstein DRO. (arXiv:2311.05573v1 [stat.ML])
    Distributionally robust optimization (DRO) is an effective approach for data-driven decision-making in the presence of uncertainty. Geometric uncertainty due to sampling or localized perturbations of data points is captured by Wasserstein DRO (WDRO), which seeks to learn a model that performs uniformly well over a Wasserstein ball centered around the observed data distribution. However, WDRO fails to account for non-geometric perturbations such as adversarial outliers, which can greatly distort the Wasserstein distance measurement and impede the learned model. We address this gap by proposing a novel outlier-robust WDRO framework for decision-making under both geometric (Wasserstein) perturbations and non-geometric (total variation (TV)) contamination that allows an $\varepsilon$-fraction of data to be arbitrarily corrupted. We design an uncertainty set using a certain robust Wasserstein ball that accounts for both perturbation types and derive minimax optimal excess risk bounds for this procedure that explicitly capture the Wasserstein and TV risks. We prove a strong duality result that enables tractable convex reformulations and efficient computation of our outlier-robust WDRO problem. When the loss function depends only on low-dimensional features of the data, we eliminate certain dimension dependencies from the risk bounds that are unavoidable in the general setting. Finally, we present experiments validating our theory on standard regression and classification tasks.
    Predicting the Position Uncertainty at the Time of Closest Approach with Diffusion Models. (arXiv:2311.05417v1 [cs.LG])
    The risk of collision between resident space objects has significantly increased in recent years. As a result, spacecraft collision avoidance procedures have become an essential part of satellite operations. To ensure safe and effective space activities, satellite owners and operators rely on constantly updated estimates of encounters. These estimates include the uncertainty associated with the position of each object at the expected TCA. These estimates are crucial in planning risk mitigation measures, such as collision avoidance manoeuvres. As the TCA approaches, the accuracy of these estimates improves, as both objects' orbit determination and propagation procedures are made for increasingly shorter time intervals. However, this improvement comes at the cost of taking place close to the critical decision moment. This means that safe avoidance manoeuvres might not be possible or could incur significant costs. Therefore, knowing the evolution of this variable in advance can be crucial for operators. This work proposes a machine learning model based on diffusion models to forecast the position uncertainty of objects involved in a close encounter, particularly for the secondary object (usually debris), which tends to be more unpredictable. We compare the performance of our model with other state-of-the-art solutions and a na\"ive baseline approach, showing that the proposed solution has the potential to significantly improve the safety and effectiveness of spacecraft operations.
    Perfecting Liquid-State Theories with Machine Intelligence. (arXiv:2311.05167v1 [physics.chem-ph])
    Recent years have seen a significant increase in the use of machine intelligence for predicting electronic structure, molecular force fields, and the physicochemical properties of various condensed systems. However, substantial challenges remain in developing a comprehensive framework capable of handling a wide range of atomic compositions and thermodynamic conditions. This perspective discusses potential future developments in liquid-state theories leveraging on recent advancements of functional machine learning. By harnessing the strengths of theoretical analysis and machine learning techniques including surrogate models, dimension reduction and uncertainty quantification, we envision that liquid-state theories will gain significant improvements in accuracy, scalability and computational efficiency, enabling their broader applications across diverse materials and chemical systems.
    Do Ensembling and Meta-Learning Improve Outlier Detection in Randomized Controlled Trials?. (arXiv:2311.05473v1 [cs.LG])
    Modern multi-centre randomized controlled trials (MCRCTs) collect massive amounts of tabular data, and are monitored intensively for irregularities by humans. We began by empirically evaluating 6 modern machine learning-based outlier detection algorithms on the task of identifying irregular data in 838 datasets from 7 real-world MCRCTs with a total of 77,001 patients from over 44 countries. Our results reinforce key findings from prior work in the outlier detection literature on data from other domains. Existing algorithms often succeed at identifying irregularities without any supervision, with at least one algorithm exhibiting positive performance 70.6% of the time. However, performance across datasets varies substantially with no single algorithm performing consistently well, motivating new techniques for unsupervised model selection or other means of aggregating potentially discordant predictions from multiple candidate models. We propose the Meta-learned Probabilistic Ensemble (MePE), a simple algorithm for aggregating the predictions of multiple unsupervised models, and show that it performs favourably compared to recent meta-learning approaches for outlier detection model selection. While meta-learning shows promise, small ensembles outperform all forms of meta-learning on average, a negative result that may guide the application of current outlier detection approaches in healthcare and other real-world domains.
    LLM Augmented Hierarchical Agents. (arXiv:2311.05596v1 [cs.LG])
    Solving long-horizon, temporally-extended tasks using Reinforcement Learning (RL) is challenging, compounded by the common practice of learning without prior knowledge (or tabula rasa learning). Humans can generate and execute plans with temporally-extended actions and quickly learn to perform new tasks because we almost never solve problems from scratch. We want autonomous agents to have this same ability. Recently, LLMs have been shown to encode a tremendous amount of knowledge about the world and to perform impressive in-context learning and reasoning. However, using LLMs to solve real world problems is hard because they are not grounded in the current task. In this paper we exploit the planning capabilities of LLMs while using RL to provide learning from the environment, resulting in a hierarchical agent that uses LLMs to solve long-horizon tasks. Instead of completely relying on LLMs, they guide a high-level policy, making learning significantly more sample efficient. This approach is evaluated in simulation environments such as MiniGrid, SkillHack, and Crafter, and on a real robot arm in block manipulation tasks. We show that agents trained using our approach outperform other baselines methods and, once trained, don't need access to LLMs during deployment.
    Towards End-to-End Spoken Grammatical Error Correction. (arXiv:2311.05550v1 [cs.CL])
    Grammatical feedback is crucial for L2 learners, teachers, and testers. Spoken grammatical error correction (GEC) aims to supply feedback to L2 learners on their use of grammar when speaking. This process usually relies on a cascaded pipeline comprising an ASR system, disfluency removal, and GEC, with the associated concern of propagating errors between these individual modules. In this paper, we introduce an alternative "end-to-end" approach to spoken GEC, exploiting a speech recognition foundation model, Whisper. This foundation model can be used to replace the whole framework or part of it, e.g., ASR and disfluency removal. These end-to-end approaches are compared to more standard cascaded approaches on the data obtained from a free-speaking spoken language assessment test, Linguaskill. Results demonstrate that end-to-end spoken GEC is possible within this architecture, but the lack of available data limits current performance compared to a system using large quantities of text-based GEC data. Conversely, end-to-end disfluency detection and removal, which is easier for the attention-based Whisper to learn, does outperform cascaded approaches. Additionally, the paper discusses the challenges of providing feedback to candidates when using end-to-end systems for spoken GEC.
    Understanding Deep Gradient Leakage via Inversion Influence Functions. (arXiv:2309.13016v2 [cs.LG] UPDATED)
    Deep Gradient Leakage (DGL) is a highly effective attack that recovers private training images from gradient vectors. This attack casts significant privacy challenges on distributed learning from clients with sensitive data, where clients are required to share gradients. Defending against such attacks requires but lacks an understanding of when and how privacy leakage happens, mostly because of the black-box nature of deep networks. In this paper, we propose a novel Inversion Influence Function (I$^2$F) that establishes a closed-form connection between the recovered images and the private gradients by implicitly solving the DGL problem. Compared to directly solving DGL, I$^2$F is scalable for analyzing deep networks, requiring only oracle access to gradients and Jacobian-vector products. We empirically demonstrate that I$^2$F effectively approximated the DGL generally on different model architectures, datasets, attack implementations, and noise-based defenses. With this novel tool, we provide insights into effective gradient perturbation directions, the unfairness of privacy protection, and privacy-preferred model initialization. Our codes are provided in https://github.com/illidanlab/inversion-influence-function.
    Joint Sensing and Semantic Communications with Multi-Task Deep Learning. (arXiv:2311.05017v1 [cs.NI])
    This paper explores the integration of deep learning techniques for joint sensing and communications, with an extension to semantic communications. The integrated system comprises a transmitter and receiver operating over a wireless channel, subject to noise and fading effects. The transmitter employs a deep neural network, namely an encoder, for joint operations of source coding, channel coding, and modulation, while the receiver utilizes another deep neural network, namely a decoder, for joint operations of demodulation, channel decoding, and source decoding to reconstruct the data samples. The transmitted signal serves a dual purpose, supporting communication with the receiver and enabling sensing. When a target is present, the reflected signal is received, and another deep neural network decoder is utilized for sensing. This decoder is responsible for detecting the target's presence and determining its range. All these deep neural networks, including one encoder and two decoders, undergo joint training through multi-task learning, considering data and channel characteristics. This paper extends to incorporate semantic communications by introducing an additional deep neural network, another decoder at the receiver, operating as a task classifier. This decoder evaluates the fidelity of label classification for received signals, enhancing the integration of semantics within the communication process. The study presents results based on using the CIFAR-10 as the input data and accounting for channel effects like Additive White Gaussian Noise (AWGN) and Rayleigh fading. The results underscore the effectiveness of multi-task deep learning in achieving high-fidelity joint sensing and semantic communications.
    LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents. (arXiv:2311.05437v1 [cs.CV])
    LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.
    A posteriori Trading-inspired Model-free Time Series Segmentation. (arXiv:1912.06708v2 [stat.ML] UPDATED)
    Within the context of multivariate time series segmentation this paper proposes a method inspired by a posteriori optimal trading. After a normalization step time series are treated channel-wise as surrogate stock prices that can be traded optimally a posteriori in a virtual portfolio holding either stock or cash. Linear transaction costs are interpreted as hyperparameters for noise filtering. Resulting trading signals as well as resulting trading signals obtained on the reversed time series are used for unsupervised labeling, before a consensus over channels is reached that determines segmentation time instants. The method is model-free such that no model prescriptions for segments are made. Benefits of proposed approach include simplicity, adaptability to a wide range of different shapes of time series, and in particular computational efficiency that make it suitable for big data. Performance is demonstrated on synthetic and real-world data, including a large-scale dataset comprising a multivariate time series of dimension 1000 and length 2709. Proposed method is compared to a popular model-based bottom-up approach fitting piecewise affine models and to a state-of-the-art model-based top-down approach fitting Gaussian models, and found to be consistently faster while producing more intuitive results.
    Effective Restoration of Source Knowledge in Continual Test Time Adaptation. (arXiv:2311.04991v1 [cs.LG])
    Traditional test-time adaptation (TTA) methods face significant challenges in adapting to dynamic environments characterized by continuously changing long-term target distributions. These challenges primarily stem from two factors: catastrophic forgetting of previously learned valuable source knowledge and gradual error accumulation caused by miscalibrated pseudo labels. To address these issues, this paper introduces an unsupervised domain change detection method that is capable of identifying domain shifts in dynamic environments and subsequently resets the model parameters to the original source pre-trained values. By restoring the knowledge from the source, it effectively corrects the negative consequences arising from the gradual deterioration of model parameters caused by ongoing shifts in the domain. Our method involves progressive estimation of global batch-norm statistics specific to each domain, while keeping track of changes in the statistics triggered by domain shifts. Importantly, our method is agnostic to the specific adaptation technique employed and thus, can be incorporated to existing TTA methods to enhance their performance in dynamic environments. We perform extensive experiments on benchmark datasets to demonstrate the superior performance of our method compared to state-of-the-art adaptation methods.
    When Meta-Learning Meets Online and Continual Learning: A Survey. (arXiv:2311.05241v1 [cs.LG])
    Over the past decade, deep neural networks have demonstrated significant success using the training scheme that involves mini-batch stochastic gradient descent on extensive datasets. Expanding upon this accomplishment, there has been a surge in research exploring the application of neural networks in other learning scenarios. One notable framework that has garnered significant attention is meta-learning. Often described as "learning to learn," meta-learning is a data-driven approach to optimize the learning algorithm. Other branches of interest are continual learning and online learning, both of which involve incrementally updating a model with streaming data. While these frameworks were initially developed independently, recent works have started investigating their combinations, proposing novel problem settings and learning algorithms. However, due to the elevated complexity and lack of unified terminology, discerning differences between the learning frameworks can be challenging even for experienced researchers. To facilitate a clear understanding, this paper provides a comprehensive survey that organizes various problem settings using consistent terminology and formal descriptions. By offering an overview of these learning paradigms, our work aims to foster further advancements in this promising area of research.
    Efficient Classification of SARS-CoV-2 Spike Sequences Using Federated Learning. (arXiv:2302.08688v2 [cs.LG] UPDATED)
    This paper presents a federated learning (FL) approach to train an AI model for SARS-Cov-2 variant classification. We analyze the SARS-CoV-2 spike sequences in a distributed way, without data sharing, to detect different variants of this rapidly mutating coronavirus. Our method maintains the confidentiality of local data (that could be stored in different locations) yet allows us to reliably detect and identify different known and unknown variants of the novel coronavirus SARS-CoV-2. Using the proposed approach, we achieve an overall accuracy of $93\%$ on the coronavirus variant identification task. We also provide details regarding how the proposed model follows the main laws of federated learning, such as Laws of data ownership, data privacy, model aggregation, and model heterogeneity. Since the proposed model is distributed, it could scale on ``Big Data'' easily. We plan to use this proof-of-concept to implement a privacy-preserving pandemic response strategy.
    Spectral Cross-Domain Neural Network with Soft-adaptive Threshold Spectral Enhancement. (arXiv:2301.10171v2 [cs.LG] UPDATED)
    Electrocardiography (ECG) signals can be considered as multi-variable time-series. The state-of-the-art ECG data classification approaches, based on either feature engineering or deep learning techniques, treat separately spectral and time domains in machine learning systems. No spectral-time domain communication mechanism inside the classifier model can be found in current approaches, leading to difficulties in identifying complex ECG forms. In this paper, we proposed a novel deep learning model named Spectral Cross-domain neural network (SCDNN) with a new block called Soft-adaptive threshold spectral enhancement (SATSE), to simultaneously reveal the key information embedded in spectral and time domains inside the neural network. More precisely, the domain-cross information is captured by a general Convolutional neural network (CNN) backbone, and different information sources are merged by a self-adaptive mechanism to mine the connection between time and spectral domains. In SATSE, the knowledge from time and spectral domains is extracted via the Fast Fourier Transformation (FFT) with soft trainable thresholds in modified Sigmoid functions. The proposed SCDNN is tested with several classification tasks implemented on the public ECG databases \textit{PTB-XL} and \textit{MIT-BIH}. SCDNN outperforms the state-of-the-art approaches with a low computational cost regarding a variety of metrics in all classification tasks on both databases, by finding appropriate domains from the infinite spectral mapping. The convergence of the trainable thresholds in the spectral domain is also numerically investigated in this paper. The robust performance of SCDNN provides a new perspective to exploit knowledge across deep learning models from time and spectral domains. The repository can be found: https://github.com/DL-WG/SCDNN-TS
    Comparing Sequential Forecasters. (arXiv:2110.00115v6 [stat.ME] UPDATED)
    Consider two forecasters, each making a single prediction for a sequence of events over time. We ask a relatively basic question: how might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts and outcomes were generated? In this paper, we present a rigorous answer to this question by designing novel sequential inference procedures for estimating the time-varying difference in forecast scores. To do this, we employ confidence sequences (CS), which are sequences of confidence intervals that can be continuously monitored and are valid at arbitrary data-dependent stopping times ("anytime-valid"). The widths of our CSs are adaptive to the underlying variance of the score differences. Underlying their construction is a game-theoretic statistical framework, in which we further identify e-processes and p-processes for sequentially testing a weak null hypothesis -- whether one forecaster outperforms another on average (rather than always). Our methods do not make distributional assumptions on the forecasts or outcomes; our main theorems apply to any bounded scores, and we later provide alternative methods for unbounded scores. We empirically validate our approaches by comparing real-world baseball and weather forecasters.
    Unpaired MRI Super Resolution with Self-Supervised Contrastive Learning. (arXiv:2310.15767v2 [eess.IV] UPDATED)
    High-resolution (HR) magnetic resonance imaging (MRI) is crucial for enhancing diagnostic accuracy in clinical settings. Nonetheless, the inherent limitation of MRI resolution restricts its widespread applicability. Deep learning-based image super-resolution (SR) methods exhibit promise in improving MRI resolution without additional cost. However, these methods frequently require a substantial number of HR MRI images for training, which can be challenging to acquire. In this paper, we propose an unpaired MRI SR approach that employs self-supervised contrastive learning to enhance SR performance with limited training data. Our approach leverages both authentic HR images and synthetically generated SR images to construct positive and negative sample pairs, thus facilitating the learning of discriminative features. Empirical results presented in this study underscore significant enhancements in the peak signal-to-noise ratio and structural similarity index, even when a paucity of HR images is available. These findings accentuate the potential of our approach in addressing the challenge of limited training data, thereby contributing to the advancement of high-resolution MRI in clinical applications.
    Logspace Reducibility From Secret Leakage Planted Clique. (arXiv:2107.11886v2 [cs.CC] UPDATED)
    The planted clique problem is well-studied in the context of observing, explaining, and predicting interesting computational phenomena associated with statistical problems. When equating computational efficiency with the existence of polynomial time algorithms, the computational hardness of (some variant of) the planted clique problem can be used to infer the computational hardness of a host of other statistical problems. Is this ability to transfer computational hardness from (some variant of) the planted clique problem to other statistical problems robust to changing our notion of computational efficiency to space efficiency? We answer this question affirmatively for three different statistical problems, namely Sparse PCA, submatrix detection, and testing almost k-wise independence. The key challenge is that space efficient randomized reductions need to repeatedly access the randomness they use. Known reductions to these problems are all randomized and need polynomially many random bits to implement. Since we can not store polynomially many random bits in memory, it is unclear how to implement these existing reductions space efficiently. There are two ideas involved in circumventing this issue and implementing known reductions to these problems space efficiently. 1. When solving statistical problems, we can use parts of the input itself as randomness. 2. Secret leakage variants of the planted clique problem with appropriate secret leakage can be more useful than the standard planted clique problem when we want to use parts of the input as randomness. (abstract shortened due to arxiv constraints)
    Simple steps are all you need: Frank-Wolfe and generalized self-concordant functions. (arXiv:2105.13913v7 [math.OC] UPDATED)
    Generalized self-concordance is a key property present in the objective function of many important learning problems. We establish the convergence rate of a simple Frank-Wolfe variant that uses the open-loop step size strategy $\gamma_t = 2/(t+2)$, obtaining a $\mathcal{O}(1/t)$ convergence rate for this class of functions in terms of primal gap and Frank-Wolfe gap, where $t$ is the iteration count. This avoids the use of second-order information or the need to estimate local smoothness parameters of previous work. We also show improved convergence rates for various common cases, e.g., when the feasible region under consideration is uniformly convex or polyhedral.
    High-Performance Transformers for Table Structure Recognition Need Early Convolutions. (arXiv:2311.05565v1 [cs.CV])
    Table structure recognition (TSR) aims to convert tabular images into a machine-readable format, where a visual encoder extracts image features and a textual decoder generates table-representing tokens. Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder. However, this hybrid CNN-Transformer architecture introduces a complex visual encoder that accounts for nearly half of the total model parameters, markedly reduces both training and inference speed, and hinders the potential for self-supervised learning in TSR. In this work, we design a lightweight visual encoder for TSR without sacrificing expressive power. We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model. The convolutional stem strikes an optimal balance between two crucial factors for high-performance TSR: a higher receptive field (RF) ratio and a longer sequence length. This allows it to "see" an appropriate portion of the table and "store" the complex table structure within sufficient context length for the subsequent transformer. We conducted reproducible ablation studies and open-sourced our code at https://github.com/poloclub/tsr-convstem to enhance transparency, inspire innovations, and facilitate fair comparisons in our domain as tables are a promising modality for representation learning.
    Byzantine-Robust Distributed Online Learning: Taming Adversarial Participants in An Adversarial Environment. (arXiv:2307.07980v2 [cs.LG] UPDATED)
    This paper studies distributed online learning under Byzantine attacks. The performance of an online learning algorithm is often characterized by (adversarial) regret, which evaluates the quality of one-step-ahead decision-making when an environment provides adversarial losses, and a sublinear bound is preferred. But we prove that, even with a class of state-of-the-art robust aggregation rules, in an adversarial environment and in the presence of Byzantine participants, distributed online gradient descent can only achieve a linear adversarial regret bound, which is tight. This is the inevitable consequence of Byzantine attacks, even though we can control the constant of the linear adversarial regret to a reasonable level. Interestingly, when the environment is not fully adversarial so that the losses of the honest participants are i.i.d. (independent and identically distributed), we show that sublinear stochastic regret, in contrast to the aforementioned adversarial regret, is possible. We develop a Byzantine-robust distributed online momentum algorithm to attain such a sublinear stochastic regret bound. Extensive numerical experiments corroborate our theoretical analysis.
    Do personality tests generalize to Large Language Models?. (arXiv:2311.05297v1 [cs.CL])
    With large language models (LLMs) appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate various properties of these models using tests originally designed for humans. While re-using existing tests is a resource-efficient way to evaluate LLMs, careful adjustments are usually required to ensure that test results are even valid across human sub-populations. Thus, it is not clear to what extent different tests' validity generalizes to LLMs. In this work, we provide evidence that LLMs' responses to personality tests systematically deviate from typical human responses, implying that these results cannot be interpreted in the same way as human test results. Concretely, reverse-coded items (e.g. "I am introverted" vs "I am extraverted") are often both answered affirmatively by LLMs. In addition, variation across different prompts designed to "steer" LLMs to simulate particular personality types does not follow the clear separation into five independent personality factors from human samples. In light of these results, we believe it is important to pay more attention to tests' validity for LLMs before drawing strong conclusions about potentially ill-defined concepts like LLMs' "personality".
    Personalized Online Federated Learning with Multiple Kernels. (arXiv:2311.05108v1 [cs.LG])
    Multi-kernel learning (MKL) exhibits well-documented performance in online non-linear function approximation. Federated learning enables a group of learners (called clients) to train an MKL model on the data distributed among clients to perform online non-linear function approximation. There are some challenges in online federated MKL that need to be addressed: i) Communication efficiency especially when a large number of kernels are considered ii) Heterogeneous data distribution among clients. The present paper develops an algorithmic framework to enable clients to communicate with the server to send their updates with affordable communication cost while clients employ a large dictionary of kernels. Utilizing random feature (RF) approximation, the present paper proposes scalable online federated MKL algorithm. We prove that using the proposed online federated MKL algorithm, each client enjoys sub-linear regret with respect to the RF approximation of its best kernel in hindsight, which indicates that the proposed algorithm can effectively deal with heterogeneity of the data distributed among clients. Experimental results on real datasets showcase the advantages of the proposed algorithm compared with other online federated kernel learning ones.
    Factorized Discriminant Analysis for Genetic Signatures of Neuronal Phenotypes. (arXiv:2010.02171v6 [q-bio.QM] UPDATED)
    Navigating the complex landscape of single-cell transcriptomic data presents significant challenges. Central to this challenge is the identification of a meaningful representation of high-dimensional gene expression patterns that sheds light on the structural and functional properties of cell types. Pursuing model interpretability and computational simplicity, we often look for a linear transformation of the original data that aligns with key phenotypic features of cells. In response to this need, we introduce factorized linear discriminant analysis (FLDA), a novel method for linear dimensionality reduction. The crux of FLDA lies in identifying a linear function of gene expression levels that is highly correlated with one phenotypic feature while minimizing the influence of others. To augment this method, we integrate it with a sparsity-based regularization algorithm. This integration is crucial as it selects a subset of genes pivotal to a specific phenotypic feature or a combination thereof. To illustrate the effectiveness of FLDA, we apply it to transcriptomic datasets from neurons in the Drosophila optic lobe. We demonstrate that FLDA not only captures the inherent structural patterns aligned with phenotypic features but also uncovers key genes associated with each phenotype.
    Exploiting Neural-Network Statistics for Low-Power DNN Inference. (arXiv:2311.05557v1 [cs.LG])
    Specialized compute blocks have been developed for efficient DNN execution. However, due to the vast amount of data and parameter movements, the interconnects and on-chip memories form another bottleneck, impairing power and performance. This work addresses this bottleneck by contributing a low-power technique for edge-AI inference engines that combines overhead-free coding with a statistical analysis of the data and parameters of neural networks. Our approach reduces the interconnect and memory power consumption by up to 80% for state-of-the-art benchmarks while providing additional power savings for the compute blocks by up to 39%. These power improvements are achieved with no loss of accuracy and negligible hardware cost.
    Accelerating Exploration with Unlabeled Prior Data. (arXiv:2311.05067v1 [cs.LG])
    Learning to solve tasks from a sparse reward signal is a major challenge for standard reinforcement learning (RL) algorithms. However, in the real world, agents rarely need to solve sparse reward tasks entirely from scratch. More often, we might possess prior experience to draw on that provides considerable guidance about which actions and outcomes are possible in the world, which we can use to explore more effectively for new tasks. In this work, we study how prior data without reward labels may be used to guide and accelerate exploration for an agent solving a new sparse reward task. We propose a simple approach that learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards, and then uses it concurrently alongside the online data for downstream policy and critic optimization. This general formula leads to rapid exploration in several challenging sparse-reward domains where tabula rasa exploration is insufficient, including the AntMaze domain, Adroit hand manipulation domain, and a visual simulated robotic manipulation domain. Our results highlight the ease of incorporating unlabeled prior data into existing online RL algorithms, and the (perhaps surprising) effectiveness of doing so.
    On the Consistency of Maximum Likelihood Estimation of Probabilistic Principal Component Analysis. (arXiv:2311.05046v1 [stat.ML])
    Probabilistic principal component analysis (PPCA) is currently one of the most used statistical tools to reduce the ambient dimension of the data. From multidimensional scaling to the imputation of missing data, PPCA has a broad spectrum of applications ranging from science and engineering to quantitative finance. Despite this wide applicability in various fields, hardly any theoretical guarantees exist to justify the soundness of the maximal likelihood (ML) solution for this model. In fact, it is well known that the maximum likelihood estimation (MLE) can only recover the true model parameters up to a rotation. The main obstruction is posed by the inherent identifiability nature of the PPCA model resulting from the rotational symmetry of the parameterization. To resolve this ambiguity, we propose a novel approach using quotient topological spaces and in particular, we show that the maximum likelihood solution is consistent in an appropriate quotient Euclidean space. Furthermore, our consistency results encompass a more general class of estimators beyond the MLE. Strong consistency of the ML estimate and consequently strong covariance estimation of the PPCA model have also been established under a compactness assumption.
    Visible to Thermal image Translation for improving visual task in low light conditions. (arXiv:2310.20190v2 [cs.CV] UPDATED)
    Several visual tasks, such as pedestrian detection and image-to-image translation, are challenging to accomplish in low light using RGB images. Heat variation of objects in thermal images can be used to overcome this. In this work, an end-to-end framework, which consists of a generative network and a detector network, is proposed to translate RGB image into Thermal ones and compare generated thermal images with real data. We have collected images from two different locations using the Parrot Anafi Thermal drone. After that, we created a two-stream network, preprocessed, augmented, the image data, and trained the generator and discriminator models from scratch. The findings demonstrate that it is feasible to translate RGB training data to thermal data using GAN. As a result, thermal data can now be produced more quickly and affordably, which is useful for security and surveillance applications.
    Diffusion Based Causal Representation Learning. (arXiv:2311.05421v1 [cs.LG])
    Causal reasoning can be considered a cornerstone of intelligent systems. Having access to an underlying causal graph comes with the promise of cause-effect estimation and the identification of efficient and safe interventions. However, learning causal representations remains a major challenge, due to the complexity of many real-world systems. Previous works on causal representation learning have mostly focused on Variational Auto-Encoders (VAE). These methods only provide representations from a point estimate, and they are unsuitable to handle high dimensions. To overcome these problems, we proposed a new Diffusion-based Causal Representation Learning (DCRL) algorithm. This algorithm uses diffusion-based representations for causal discovery. DCRL offers access to infinite dimensional latent codes, which encode different levels of information in the latent code. In a first proof of principle, we investigate the use of DCRL for causal representation learning. We further demonstrate experimentally that this approach performs comparably well in identifying the causal structure and causal variables.
    Statistical Learning of Conjunction Data Messages Through a Bayesian Non-Homogeneous Poisson Process. (arXiv:2311.05426v1 [cs.LG])
    Current approaches for collision avoidance and space traffic management face many challenges, mainly due to the continuous increase in the number of objects in orbit and the lack of scalable and automated solutions. To avoid catastrophic incidents, satellite owners/operators must be aware of their assets' collision risk to decide whether a collision avoidance manoeuvre needs to be performed. This process is typically executed through the use of warnings issued in the form of CDMs which contain information about the event, such as the expected TCA and the probability of collision. Our previous work presented a statistical learning model that allowed us to answer two important questions: (1) Will any new conjunctions be issued in the next specified time interval? (2) When and with what uncertainty will the next CDM arrive? However, the model was based on an empirical Bayes homogeneous Poisson process, which assumes that the arrival rates of CDMs are constant over time. In fact, the rate at which the CDMs are issued depends on the behaviour of the objects as well as on the screening process performed by third parties. Thus, in this work, we extend the previous study and propose a Bayesian non-homogeneous Poisson process implemented with high precision using a Probabilistic Programming Language to fully describe the underlying phenomena. We compare the proposed solution with a baseline model to demonstrate the added value of our approach. The results show that this problem can be successfully modelled by our Bayesian non-homogeneous Poisson Process with greater accuracy, contributing to the development of automated collision avoidance systems and helping operators react timely but sparingly with satellite manoeuvres.
    Uncertainty Wrapper in the medical domain: Establishing transparent uncertainty quantification for opaque machine learning models in practice. (arXiv:2311.05245v1 [cs.LG])
    When systems use data-based models that are based on machine learning (ML), errors in their results cannot be ruled out. This is particularly critical if it remains unclear to the user how these models arrived at their decisions and if errors can have safety-relevant consequences, as is often the case in the medical field. In such cases, the use of dependable methods to quantify the uncertainty remaining in a result allows the user to make an informed decision about further usage and draw possible conclusions based on a given result. This paper demonstrates the applicability and practical utility of the Uncertainty Wrapper using flow cytometry as an application from the medical field that can benefit from the use of ML models in conjunction with dependable and transparent uncertainty quantification.
    Vicarious Offense and Noise Audit of Offensive Speech Classifiers: Unifying Human and Machine Disagreement on What is Offensive. (arXiv:2301.12534v4 [cs.CL] UPDATED)
    Offensive speech detection is a key component of content moderation. However, what is offensive can be highly subjective. This paper investigates how machine and human moderators disagree on what is offensive when it comes to real-world social web political discourse. We show that (1) there is extensive disagreement among the moderators (humans and machines); and (2) human and large-language-model classifiers are unable to predict how other human raters will respond, based on their political leanings. For (1), we conduct a noise audit at an unprecedented scale that combines both machine and human responses. For (2), we introduce a first-of-its-kind dataset of vicarious offense. Our noise audit reveals that moderation outcomes vary wildly across different machine moderators. Our experiments with human moderators suggest that political leanings combined with sensitive issues affect both first-person and vicarious offense. The dataset is available through https://github.com/Homan-Lab/voiced.
    Uncoupled and Convergent Learning in Two-Player Zero-Sum Markov Games with Bandit Feedback. (arXiv:2303.02738v2 [cs.GT] UPDATED)
    We revisit the problem of learning in two-player zero-sum Markov games, focusing on developing an algorithm that is uncoupled, convergent, and rational, with non-asymptotic convergence rates. We start from the case of stateless matrix game with bandit feedback as a warm-up, showing an $O(t^{-\frac{1}{8}})$ last-iterate convergence rate. To the best of our knowledge, this is the first result that obtains finite last-iterate convergence rate given access to only bandit feedback. We extend our result to the case of irreducible Markov games, providing a last-iterate convergence rate of $O(t^{-\frac{1}{9+\varepsilon}})$ for any $\varepsilon>0$. Finally, we study Markov games without any assumptions on the dynamics, and show a path convergence rate, which is a new notion of convergence we defined, of $O(t^{-\frac{1}{10}})$. Our algorithm removes the coordination and prior knowledge requirement of [Wei et al., 2021], which pursued the same goals as us for irreducible Markov games. Our algorithm is related to [Chen et al., 2021, Cen et al., 2021] and also builds on the entropy regularization technique. However, we remove their requirement of communications on the entropy values, making our algorithm entirely uncoupled.
    On the approximation capability of GNNs in node classification/regression tasks. (arXiv:2106.08992v6 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) are a broad class of connectionist models for graph processing. Recent studies have shown that GNNs can approximate any function on graphs, modulo the equivalence relation on graphs defined by the Weisfeiler--Lehman (WL) test. However, these results suffer from some limitations, both because they were derived using the Stone--Weierstrass theorem -- which is existential in nature, -- and because they assume that the target function to be approximated must be continuous. Furthermore, all current results are dedicated to graph classification/regression tasks, where the GNN must produce a single output for the whole graph, while also node classification/regression problems, in which an output is returned for each node, are very common. In this paper, we propose an alternative way to demonstrate the approximation capability of GNNs that overcomes these limitations. Indeed, we show that GNNs are universal approximators in probability for node classification/regression tasks, as they can approximate any measurable function that satisfies the 1--WL equivalence on nodes. The proposed theoretical framework allows the approximation of generic discontinuous target functions and also suggests the GNN architecture that can reach a desired approximation. In addition, we provide a bound on the number of the GNN layers required to achieve the desired degree of approximation, namely $2r-1$, where $r$ is the maximum number of nodes for the graphs in the domain.
    Accelerated Shapley Value Approximation for Data Evaluation. (arXiv:2311.05346v1 [cs.LG])
    Data valuation has found various applications in machine learning, such as data filtering, efficient learning and incentives for data sharing. The most popular current approach to data valuation is the Shapley value. While popular for its various applications, Shapley value is computationally expensive even to approximate, as it requires repeated iterations of training models on different subsets of data. In this paper we show that the Shapley value of data points can be approximated more efficiently by leveraging the structural properties of machine learning problems. We derive convergence guarantees on the accuracy of the approximate Shapley value for different learning settings including Stochastic Gradient Descent with convex and non-convex loss functions. Our analysis suggests that in fact models trained on small subsets are more important in the context of data valuation. Based on this idea, we describe $\delta$-Shapley -- a strategy of only using small subsets for the approximation. Experiments show that this approach preserves approximate value and rank of data, while achieving speedup of up to 9.9x. In pre-trained networks the approach is found to bring more efficiency in terms of accurate evaluation using small subsets.
    An Efficient Federated Learning Framework for Training Semantic Communication System. (arXiv:2310.13236v2 [cs.LG] UPDATED)
    Semantic communication has emerged as a pillar for the next generation of communication systems due to its capabilities in alleviating data redundancy. Most semantic communication systems are built upon advanced deep learning models whose training performance heavily relies on data availability. Existing studies often make unrealistic assumptions of a readily accessible data source, where in practice, data is mainly created on the client side. Due to privacy and security concerns, the transmission of data is restricted, which is necessary for conventional centralized training schemes. To address this challenge, we explore semantic communication in a federated learning (FL) setting that utilizes client data without leaking privacy. Additionally, we design our system to tackle the communication overhead by reducing the quantity of information delivered in each global round. In this way, we can save significant bandwidth for resource-limited devices and reduce overall network traffic. Finally, we introduce a mechanism to aggregate the global model from clients, called FedLol. Extensive simulation results demonstrate the effectiveness of our proposed technique compared to baseline methods.
    Whisper in Focus: Enhancing Stuttered Speech Classification with Encoder Layer Optimization. (arXiv:2311.05203v1 [cs.SD])
    In recent years, advancements in the field of speech processing have led to cutting-edge deep learning algorithms with immense potential for real-world applications. The automated identification of stuttered speech is one of such applications that the researchers are addressing by employing deep learning techniques. Recently, researchers have utilized Wav2vec2.0, a speech recognition model to classify disfluency types in stuttered speech. Although Wav2vec2.0 has shown commendable results, its ability to generalize across all disfluency types is limited. In addition, since its base model uses 12 encoder layers, it is considered a resource-intensive model. Our study unravels the capabilities of Whisper for the classification of disfluency types in stuttered speech. We have made notable contributions in three pivotal areas: enhancing the quality of SEP28-k benchmark dataset, exploration of Whisper for classification, and introducing an efficient encoder layer freezing strategy. The optimized Whisper model has achieved the average F1-score of 0.81, which proffers its abilities. This study also unwinds the significance of deeper encoder layers in the identification of disfluency types, as the results demonstrate their greater contribution compared to initial layers. This research represents substantial contributions, shifting the emphasis towards an efficient solution, thereby thriving towards prospective innovation.
    Exploring and Analyzing Wildland Fire Data Via Machine Learning Techniques. (arXiv:2311.05128v1 [cs.LG])
    This research project investigated the correlation between a 10 Hz time series of thermocouple temperatures and turbulent kinetic energy (TKE) computed from wind speeds collected from a small experimental prescribed burn at the Silas Little Experimental Forest in New Jersey, USA. The primary objective of this project was to explore the potential for using thermocouple temperatures as predictors for estimating the TKE produced by a wildland fire. Machine learning models, including Deep Neural Networks, Random Forest Regressor, Gradient Boosting, and Gaussian Process Regressor, are employed to assess the potential for thermocouple temperature perturbations to predict TKE values. Data visualization and correlation analyses reveal patterns and relationships between thermocouple temperatures and TKE, providing insight into the underlying dynamics. The project achieves high accuracy in predicting TKE by employing various machine learning models despite a weak correlation between the predictors and the target variable. The results demonstrate significant success, particularly from regression models, in accurately estimating the TKE. The research findings contribute to fire behavior and smoke modeling science, emphasizing the importance of incorporating machine learning approaches and identifying complex relationships between fine-scale fire behavior and turbulence. Accurate TKE estimation using thermocouple temperatures allows for the refinement of models that can inform decision-making in fire management strategies, facilitate effective risk mitigation, and optimize fire management efforts. This project highlights the valuable role of machine learning techniques in analyzing wildland fire data, showcasing their potential to advance fire research and management practices.
    Gaussian Cooling and Dikin Walks: The Interior-Point Method for Logconcave Sampling. (arXiv:2307.12943v3 [cs.DS] UPDATED)
    The connections between (convex) optimization and (logconcave) sampling have been considerably enriched in the past decade with many conceptual and mathematical analogies. For instance, the Langevin algorithm can be viewed as a sampling analogue of gradient descent and has condition-number-dependent guarantees on its performance. In the early 1990s, Nesterov and Nemirovski developed the Interior-Point Method (IPM) for convex optimization based on self-concordant barriers, providing efficient algorithms for structured convex optimization, often faster than the general method. This raises the following question: can we develop an analogous IPM for structured sampling problems? In 2012, Kannan and Narayanan proposed the Dikin walk for uniformly sampling polytopes, and an improved analysis was given in 2020 by Laddha-Lee-Vempala. The Dikin walk uses a local metric defined by a self-concordant barrier for linear constraints. Here we generalize this approach by developing and adapting IPM machinery together with the Dikin walk for poly-time sampling algorithms. Our IPM-based sampling framework provides an efficient warm start and goes beyond uniform distributions and linear constraints. We illustrate the approach on important special cases, in particular giving the fastest algorithms to sample uniform, exponential, or Gaussian distributions on a truncated PSD cone. The framework is general and can be applied to other sampling algorithms.
    Adapting Contrastive Language-Image Pretrained (CLIP) Models for Out-of-Distribution Detection. (arXiv:2303.05828v2 [cs.CV] UPDATED)
    We present a comprehensive experimental study on pretrained feature extractors for visual out-of-distribution (OOD) detection, focusing on adapting contrastive language-image pretrained (CLIP) models. Without fine-tuning on the training data, we are able to establish a positive correlation ($R^2\geq0.92$) between in-distribution classification and unsupervised OOD detection for CLIP models in $4$ benchmarks. We further propose a new simple and scalable method called \textit{pseudo-label probing} (PLP) that adapts vision-language models for OOD detection. Given a set of label names of the training set, PLP trains a linear layer using the pseudo-labels derived from the text encoder of CLIP. To test the OOD detection robustness of pretrained models, we develop a novel feature-based adversarial OOD data manipulation approach to create adversarial samples. Intriguingly, we show that (i) PLP outperforms the previous state-of-the-art \citep{ming2022mcm} on all $5$ large-scale benchmarks based on ImageNet, specifically by an average AUROC gain of 3.4\% using the largest CLIP model (ViT-G), (ii) we show that linear probing outperforms fine-tuning by large margins for CLIP architectures (i.e. CLIP ViT-H achieves a mean gain of 7.3\% AUROC on average on all ImageNet-based benchmarks), and (iii) billion-parameter CLIP models still fail at detecting adversarially manipulated OOD images. The code and adversarially created datasets will be made publicly available.
    On the Behavior of Audio-Visual Fusion Architectures in Identity Verification Tasks. (arXiv:2311.05071v1 [cs.LG])
    We train an identity verification architecture and evaluate modifications to the part of the model that combines audio and visual representations, including in scenarios where one input is missing in either of two examples to be compared. We report results on the Voxceleb1-E test set that suggest averaging the output embeddings improves error rate in the full-modality setting and when a single modality is missing, and makes more complete use of the embedding space than systems which use shared layers and discuss possible reasons for this behavior.
    An Interdisciplinary Outlook on Large Language Models for Scientific Research. (arXiv:2311.04929v1 [cs.CL])
    In this paper, we describe the capabilities and constraints of Large Language Models (LLMs) within disparate academic disciplines, aiming to delineate their strengths and limitations with precision. We examine how LLMs augment scientific inquiry, offering concrete examples such as accelerating literature review by summarizing vast numbers of publications, enhancing code development through automated syntax correction, and refining the scientific writing process. Simultaneously, we articulate the challenges LLMs face, including their reliance on extensive and sometimes biased datasets, and the potential ethical dilemmas stemming from their use. Our critical discussion extends to the varying impacts of LLMs across fields, from the natural sciences, where they help model complex biological sequences, to the social sciences, where they can parse large-scale qualitative data. We conclude by offering a nuanced perspective on how LLMs can be both a boon and a boundary to scientific progress.
    Sorting Out Quantum Monte Carlo. (arXiv:2311.05598v1 [cs.LG])
    Molecular modeling at the quantum level requires choosing a parameterization of the wavefunction that both respects the required particle symmetries, and is scalable to systems of many particles. For the simulation of fermions, valid parameterizations must be antisymmetric with respect to the exchange of particles. Typically, antisymmetry is enforced by leveraging the anti-symmetry of determinants with respect to the exchange of matrix rows, but this involves computing a full determinant each time the wavefunction is evaluated. Instead, we introduce a new antisymmetrization layer derived from sorting, the $\textit{sortlet}$, which scales as $O(N \log N)$ with regards to the number of particles -- in contrast to $O(N^3)$ for the determinant. We show numerically that applying this anti-symmeterization layer on top of an attention based neural-network backbone yields a flexible wavefunction parameterization capable of reaching chemical accuracy when approximating the ground state of first-row atoms and small molecules.
    Multi-modal Graph Learning over UMLS Knowledge Graphs. (arXiv:2307.04461v2 [cs.LG] UPDATED)
    Clinicians are increasingly looking towards machine learning to gain insights about patient evolutions. We propose a novel approach named Multi-Modal UMLS Graph Learning (MMUGL) for learning meaningful representations of medical concepts using graph neural networks over knowledge graphs based on the unified medical language system. These representations are aggregated to represent entire patient visits and then fed into a sequence model to perform predictions at the granularity of multiple hospital visits of a patient. We improve performance by incorporating prior medical knowledge and considering multiple modalities. We compare our method to existing architectures proposed to learn representations at different granularities on the MIMIC-III dataset and show that our approach outperforms these methods. The results demonstrate the significance of multi-modal medical concept representations based on prior medical knowledge.
    Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations. (arXiv:2311.05584v1 [cs.LG])
    Large language models (LLMs) have emerged as powerful and general solutions to many natural language tasks. However, many of the most important applications of language generation are interactive, where an agent has to talk to a person to reach a desired outcome. For example, a teacher might try to understand their student's current comprehension level to tailor their instruction accordingly, and a travel agent might ask questions of their customer to understand their preferences in order to recommend activities they might enjoy. LLMs trained with supervised fine-tuning or "single-step" RL, as with standard RLHF, might struggle which tasks that require such goal-directed behavior, since they are not trained to optimize for overall conversational outcomes after multiple turns of interaction. In this work, we explore a new method for adapting LLMs with RL for such goal-directed dialogue. Our key insight is that, though LLMs might not effectively solve goal-directed dialogue tasks out of the box, they can provide useful data for solving such tasks by simulating suboptimal but human-like behaviors. Given a textual description of a goal-directed dialogue task, we leverage LLMs to sample diverse synthetic rollouts of hypothetical in-domain human-human interactions. Our algorithm then utilizes this dataset with offline reinforcement learning to train an interactive conversational agent that can optimize goal-directed objectives over multiple turns. In effect, the LLM produces examples of possible interactions, and RL then processes these examples to learn to perform more optimal interactions. Empirically, we show that our proposed approach achieves state-of-the-art performance in various goal-directed dialogue tasks that include teaching and preference elicitation.
    Multimodal Clinical Benchmark for Emergency Care (MC-BEC): A Comprehensive Benchmark for Evaluating Foundation Models in Emergency Medicine. (arXiv:2311.04937v1 [cs.LG])
    We propose the Multimodal Clinical Benchmark for Emergency Care (MC-BEC), a comprehensive benchmark for evaluating foundation models in Emergency Medicine using a dataset of 100K+ continuously monitored Emergency Department visits from 2020-2022. MC-BEC focuses on clinically relevant prediction tasks at timescales from minutes to days, including predicting patient decompensation, disposition, and emergency department (ED) revisit, and includes a standardized evaluation framework with train-test splits and evaluation metrics. The multimodal dataset includes a wide range of detailed clinical data, including triage information, prior diagnoses and medications, continuously measured vital signs, electrocardiogram and photoplethysmograph waveforms, orders placed and medications administered throughout the visit, free-text reports of imaging studies, and information on ED diagnosis, disposition, and subsequent revisits. We provide performance baselines for each prediction task to enable the evaluation of multimodal, multitask models. We believe that MC-BEC will encourage researchers to develop more effective, generalizable, and accessible foundation models for multimodal clinical data.
    Active Transfer Learning for Efficient Video-Specific Human Pose Estimation. (arXiv:2311.05041v1 [cs.CV])
    Human Pose (HP) estimation is actively researched because of its wide range of applications. However, even estimators pre-trained on large datasets may not perform satisfactorily due to a domain gap between the training and test data. To address this issue, we present our approach combining Active Learning (AL) and Transfer Learning (TL) to adapt HP estimators to individual video domains efficiently. For efficient learning, our approach quantifies (i) the estimation uncertainty based on the temporal changes in the estimated heatmaps and (ii) the unnaturalness in the estimated full-body HPs. These quantified criteria are then effectively combined with the state-of-the-art representativeness criterion to select uncertain and diverse samples for efficient HP estimator learning. Furthermore, we reconsider the existing Active Transfer Learning (ATL) method to introduce novel ideas related to the retraining methods and Stopping Criteria (SC). Experimental results demonstrate that our method enhances learning efficiency and outperforms comparative methods. Our code is publicly available at: https://github.com/ImIntheMiddle/VATL4Pose-WACV2024
    Expressibility-induced Concentration of Quantum Neural Tangent Kernels. (arXiv:2311.04965v1 [quant-ph])
    Quantum tangent kernel methods provide an efficient approach to analyzing the performance of quantum machine learning models in the infinite-width limit, which is of crucial importance in designing appropriate circuit architectures for certain learning tasks. Recently, they have been adapted to describe the convergence rate of training errors in quantum neural networks in an analytical manner. Here, we study the connections between the trainability and expressibility of quantum tangent kernel models. In particular, for global loss functions, we rigorously prove that high expressibility of both the global and local quantum encodings can lead to exponential concentration of quantum tangent kernel values to zero. Whereas for local loss functions, such issue of exponential concentration persists owing to the high expressibility, but can be partially mitigated. We further carry out extensive numerical simulations to support our analytical theories. Our discoveries unveil a pivotal characteristic of quantum neural tangent kernels, offering valuable insights for the design of wide quantum variational circuit models in practical applications.
    Improved DDIM Sampling with Moment Matching Gaussian Mixtures. (arXiv:2311.04938v1 [cs.CV])
    We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ and class-conditional models trained on ImageNet datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel.
    Price Interpretability of Prediction Markets: A Convergence Analysis. (arXiv:2205.08913v2 [q-fin.TR] UPDATED)
    Prediction markets are long known for prediction accuracy. This study systematically explores the fundamental properties of prediction markets, addressing questions about their information aggregation process and the factors contributing to their remarkable efficacy. We propose a novel multivariate utility (MU) based mechanism that unifies several existing automated market-making schemes. Using this mechanism, we establish the convergence results for markets comprised of risk-averse traders who have heterogeneous beliefs and repeatedly interact with the market maker. We demonstrate that the resulting limiting wealth distribution aligns with the Pareto efficient frontier defined by the utilities of all market participants. With the help of this result, we establish analytical and numerical results for the limiting price in different market models. Specifically, we show that the limiting price converges to the geometric mean of agent beliefs in exponential utility-based markets. In risk-measure-based markets, we construct a family of risk measures that satisfy the convergence criteria and prove that the price can converge to a unique level represented by the weighted power mean of agent beliefs. In broader markets with Constant Relative Risk Aversion (CRRA) utilities, we reveal that the limiting price can be characterized by systems of equations that encapsulate agent beliefs, risk parameters, and wealth. Despite the potential impact of traders' trading sequences on the limiting price, we establish a price invariance result for markets with a large trader population. Using this result, we propose an efficient approximation scheme for the limiting price.
    Beyond the training set: an intuitive method for detecting distribution shift in model-based optimization. (arXiv:2311.05363v1 [cs.LG])
    Model-based optimization (MBO) is increasingly applied to design problems in science and engineering. A common scenario involves using a fixed training set to train models, with the goal of designing new samples that outperform those present in the training data. A major challenge in this setting is distribution shift, where the distributions of training and design samples are different. While some shift is expected, as the goal is to create better designs, this change can negatively affect model accuracy and subsequently, design quality. Despite the widespread nature of this problem, addressing it demands deep domain knowledge and artful application. To tackle this issue, we propose a straightforward method for design practitioners that detects distribution shifts. This method trains a binary classifier using knowledge of the unlabeled design distribution to separate the training data from the design data. The classifier's logit scores are then used as a proxy measure of distribution shift. We validate our method in a real-world application by running offline MBO and evaluate the effect of distribution shift on design quality. We find that the intensity of the shift in the design distribution varies based on the number of steps taken by the optimization algorithm, and our simple approach can identify these shifts. This enables users to constrain their search to regions where the model's predictions are reliable, thereby increasing the quality of designs.
    Reliable and Efficient Data Collection in UAV-based IoT Networks. (arXiv:2311.05303v1 [cs.NI])
    Internet of Things (IoT) involves sensors for monitoring and wireless networks for efficient communication. However, resource-constrained IoT devices and limitations in existing wireless technologies hinder its full potential. Integrating Unmanned Aerial Vehicles (UAVs) into IoT networks can address some challenges by expanding its' coverage, providing security, and bringing computing closer to IoT devices. Nevertheless, effective data collection in UAV-assisted IoT networks is hampered by factors, including dynamic UAV behavior, environmental variables, connectivity instability, and security considerations. In this survey, we first explore UAV-based IoT networks, focusing on communication and networking aspects. Next, we cover various UAV-based data collection methods their advantages and disadvantages, followed by a discussion on performance metrics for data collection. As this article primarily emphasizes reliable and efficient data collection in UAV-assisted IoT networks, we briefly discuss existing research on data accuracy and consistency, network connectivity, and data security and privacy to provide insights into reliable data collection. Additionally, we discuss efficient data collection strategies in UAV-based IoT networks, covering trajectory and path planning, collision avoidance, sensor network clustering, data aggregation, UAV swarm formations, and artificial intelligence for optimization. We also present two use cases of UAVs as a service for enhancing data collection reliability and efficiency. Finally, we discuss future challenges in data collection for UAV-assisted IoT networks.
    DEMASQ: Unmasking the ChatGPT Wordsmith. (arXiv:2311.05019v1 [cs.CR])
    The potential misuse of ChatGPT and other Large Language Models (LLMs) has raised concerns regarding the dissemination of false information, plagiarism, academic dishonesty, and fraudulent activities. Consequently, distinguishing between AI-generated and human-generated content has emerged as an intriguing research topic. However, current text detection methods lack precision and are often restricted to specific tasks or domains, making them inadequate for identifying content generated by ChatGPT. In this paper, we propose an effective ChatGPT detector named DEMASQ, which accurately identifies ChatGPT-generated content. Our method addresses two critical factors: (i) the distinct biases in text composition observed in human- and machine-generated content and (ii) the alterations made by humans to evade previous detection methods. DEMASQ is an energy-based detection model that incorporates novel aspects, such as (i) optimization inspired by the Doppler effect to capture the interdependence between input text embeddings and output labels, and (ii) the use of explainable AI techniques to generate diverse perturbations. To evaluate our detector, we create a benchmark dataset comprising a mixture of prompts from both ChatGPT and humans, encompassing domains such as medical, open Q&A, finance, wiki, and Reddit. Our evaluation demonstrates that DEMASQ achieves high accuracy in identifying content generated by ChatGPT.
    PLEX: Making the Most of the Available Data for Robotic Manipulation Pretraining. (arXiv:2303.08789v2 [cs.RO] UPDATED)
    A rich representation is key to general robotic manipulation, but existing approaches to representation learning require large amounts of multimodal demonstrations. In this work we propose PLEX, a transformer-based architecture that learns from a small amount of task-agnostic visuomotor trajectories and a much larger amount of task-conditioned object manipulation videos -- a type of data available in quantity. PLEX uses visuomotor trajectories to induce a latent feature space and to learn task-agnostic manipulation routines, while diverse video-only demonstrations teach PLEX how to plan in the induced latent feature space for a wide variety of tasks. Experiments showcase PLEX's generalization on Meta-World and SOTA performance in challenging Robosuite environments. In particular, using relative positional encoding in PLEX's transformers greatly helps in low-data regimes of learning from human-collected demonstrations. The paper's accompanying code and data are available at https://microsoft.github.io/PLEX.
    Perfectly Accurate Membership Inference by a Dishonest Central Server in Federated Learning. (arXiv:2203.16463v2 [cs.LG] UPDATED)
    Federated Learning is expected to provide strong privacy guarantees, as only gradients or model parameters but no plain text training data is ever exchanged either between the clients or between the clients and the central server. In this paper, we challenge this claim by introducing a simple but still very effective membership inference attack algorithm, which relies only on a single training step. In contrast to the popular honest-but-curious model, we investigate a framework with a dishonest central server. Our strategy is applicable to models with ReLU activations and uses the properties of this activation function to achieve perfect accuracy. Empirical evaluation on visual classification tasks with MNIST, CIFAR10, CIFAR100 and CelebA datasets show that our method provides perfect accuracy in identifying one sample in a training set with thousands of samples. Occasional failures of our method lead us to discover duplicate images in the CIFAR100 and CelebA datasets.  ( 2 min )
    Counterfactually Fair Representation. (arXiv:2311.05420v1 [cs.LG])
    The use of machine learning models in high-stake applications (e.g., healthcare, lending, college admission) has raised growing concerns due to potential biases against protected social groups. Various fairness notions and methods have been proposed to mitigate such biases. In this work, we focus on Counterfactual Fairness (CF), a fairness notion that is dependent on an underlying causal graph and first proposed by Kusner \textit{et al.}~\cite{kusner2017counterfactual}; it requires that the outcome an individual perceives is the same in the real world as it would be in a "counterfactual" world, in which the individual belongs to another social group. Learning fair models satisfying CF can be challenging. It was shown in \cite{kusner2017counterfactual} that a sufficient condition for satisfying CF is to \textbf{not} use features that are descendants of sensitive attributes in the causal graph. This implies a simple method that learns CF models only using non-descendants of sensitive attributes while eliminating all descendants. Although several subsequent works proposed methods that use all features for training CF models, there is no theoretical guarantee that they can satisfy CF. In contrast, this work proposes a new algorithm that trains models using all the available features. We theoretically and empirically show that models trained with this method can satisfy CF\footnote{The code repository for this work can be found in \url{https://github.com/osu-srml/CF_Representation_Learning}}.
    GeoFormer: Predicting Human Mobility using Generative Pre-trained Transformer (GPT). (arXiv:2311.05092v1 [cs.LG])
    Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility. Our proposed model is rigorously tested in the context of the HuMob Challenge 2023 -- a competition designed to evaluate the performance of prediction models on standardized datasets to predict human mobility. The challenge leverages two datasets encompassing urban-scale data of 25,000 and 100,000 individuals over a longitudinal period of 75 days. GeoFormer stands out as a top performer in the competition, securing a place in the top-3 ranking. Its success is underscored by performing well on both performance metrics chosen for the competition -- the GEO-BLEU and the Dynamic Time Warping (DTW) measures. The performance of the GeoFormer on the HuMob Challenge 2023 underscores its potential to make substantial contributions to the field of human mobility prediction, with far-reaching implications for disaster preparedness, epidemic control, and beyond.
    Social Media Bot Detection using Dropout-GAN. (arXiv:2311.05079v1 [cs.LG])
    Bot activity on social media platforms is a pervasive problem, undermining the credibility of online discourse and potentially leading to cybercrime. We propose an approach to bot detection using Generative Adversarial Networks (GAN). We discuss how we overcome the issue of mode collapse by utilizing multiple discriminators to train against one generator, while decoupling the discriminator to perform social media bot detection and utilizing the generator for data augmentation. In terms of classification accuracy, our approach outperforms the state-of-the-art techniques in this field. We also show how the generator in the GAN can be used to evade such a classification technique.
    Real-time Addressee Estimation: Deployment of a Deep-Learning Model on the iCub Robot. (arXiv:2311.05334v1 [cs.RO])
    Addressee Estimation is the ability to understand to whom a person is talking, a skill essential for social robots to interact smoothly with humans. In this sense, it is one of the problems that must be tackled to develop effective conversational agents in multi-party and unstructured scenarios. As humans, one of the channels that mainly lead us to such estimation is the non-verbal behavior of speakers: first of all, their gaze and body pose. Inspired by human perceptual skills, in the present work, a deep-learning model for Addressee Estimation relying on these two non-verbal features is designed, trained, and deployed on an iCub robot. The study presents the procedure of such implementation and the performance of the model deployed in real-time human-robot interaction compared to previous tests on the dataset used for the training.
    Auto deep learning for bioacoustic signals. (arXiv:2311.04945v1 [cs.LG])
    This study investigates the potential of automated deep learning to enhance the accuracy and efficiency of multi-class classification of bird vocalizations, compared against traditional manually-designed deep learning models. Using the Western Mediterranean Wetland Birds dataset, we investigated the use of AutoKeras, an automated machine learning framework, to automate neural architecture search and hyperparameter tuning. Comparative analysis validates our hypothesis that the AutoKeras-derived model consistently outperforms traditional models like MobileNet, ResNet50 and VGG16. Our approach and findings underscore the transformative potential of automated deep learning for advancing bioacoustics research and models. In fact, the automated techniques eliminate the need for manual feature engineering and model design while improving performance. This study illuminates best practices in sampling, evaluation and reporting to enhance reproducibility in this nascent field. All the code used is available at https: //github.com/giuliotosato/AutoKeras-bioacustic Keywords: AutoKeras; automated deep learning; audio classification; Wetlands Bird dataset; comparative analysis; bioacoustics; validation dataset; multi-class classification; spectrograms.
    Counterfactually Comparing Abstaining Classifiers. (arXiv:2305.10564v2 [stat.ML] UPDATED)
    Abstaining classifiers have the option to abstain from making predictions on inputs that they are unsure about. These classifiers are becoming increasingly popular in high-stakes decision-making problems, as they can withhold uncertain predictions to improve their reliability and safety. When evaluating black-box abstaining classifier(s), however, we lack a principled approach that accounts for what the classifier would have predicted on its abstentions. These missing predictions matter when they can eventually be utilized, either directly or as a backup option in a failure mode. In this paper, we introduce a novel approach and perspective to the problem of evaluating and comparing abstaining classifiers by treating abstentions as missing data. Our evaluation approach is centered around defining the counterfactual score of an abstaining classifier, defined as the expected performance of the classifier had it not been allowed to abstain. We specify the conditions under which the counterfactual score is identifiable: if the abstentions are stochastic, and if the evaluation data is independent of the training data (ensuring that the predictions are missing at random), then the score is identifiable. Note that, if abstentions are deterministic, then the score is unidentifiable because the classifier can perform arbitrarily poorly on its abstentions. Leveraging tools from observational causal inference, we then develop nonparametric and doubly robust methods to efficiently estimate this quantity under identification. Our approach is examined in both simulated and real data experiments.  ( 3 min )
    Is ChatGPT a game changer for geocoding -- a benchmark for geocoding address parsing techniques. (arXiv:2310.14360v2 [cs.CL] UPDATED)
    The remarkable success of GPT models across various tasks, including toponymy recognition motivates us to assess the performance of the GPT-3 model in the geocoding address parsing task. To ensure that the evaluation more accurately mirrors performance in real-world scenarios with diverse user input qualities and resolve the pressing need for a 'gold standard' evaluation dataset for geocoding systems, we introduce a benchmark dataset of low-quality address descriptions synthesized based on human input patterns mining from actual input logs of a geocoding system in production. This dataset has 21 different input errors and variations; contains over 239,000 address records that are uniquely selected from streets across all U.S. 50 states and D.C.; and consists of three subsets to be used as training, validation, and testing sets. Building on this, we train and gauge the performance of the GPT-3 model in extracting address components, contrasting its performance with transformer-based and LSTM-based models. The evaluation results indicate that Bidirectional LSTM-CRF model has achieved the best performance over these transformer-based models and GPT-3 model. Transformer-based models demonstrate very comparable results compared to the Bidirectional LSTM-CRF model. The GPT-3 model, though trailing in performance, showcases potential in the address parsing task with few-shot examples, exhibiting room for improvement with additional fine-tuning. We open source the code and data of this presented benchmark so that researchers can utilize it for future model development or extend it to evaluate similar tasks, such as document geocoding.  ( 3 min )
    Identifiability Guarantees for Causal Disentanglement from Soft Interventions. (arXiv:2307.06250v3 [stat.ML] UPDATED)
    Causal disentanglement aims to uncover a representation of data using latent variables that are interrelated through a causal model. Such a representation is identifiable if the latent model that explains the data is unique. In this paper, we focus on the scenario where unpaired observational and interventional data are available, with each intervention changing the mechanism of a latent variable. When the causal variables are fully observed, statistically consistent algorithms have been developed to identify the causal model under faithfulness assumptions. We here show that identifiability can still be achieved with unobserved causal variables, given a generalized notion of faithfulness. Our results guarantee that we can recover the latent causal model up to an equivalence class and predict the effect of unseen combinations of interventions, in the limit of infinite data. We implement our causal disentanglement framework by developing an autoencoding variational Bayes algorithm and apply it to the problem of predicting combinatorial perturbation effects in genomics.  ( 2 min )
    Operator Splitting for Learning to Predict Equilibria in Convex Games. (arXiv:2106.00906v3 [cs.LG] UPDATED)
    Systems of competing agents can often be modeled as games. Assuming rationality, the most likely outcomes are given by an equilibrium (e.g. a Nash equilibrium). In many practical settings, games are influenced by context, i.e. additional data beyond the control of any agent (e.g. weather for traffic and fiscal policy for market economies). Often the exact game mechanics are unknown, yet vast amounts of historical data consisting of (context, equilibrium) pairs are available, raising the possibility of learning a solver which predicts the equilibria given only the context. We introduce Nash Fixed Point Networks (N-FPNs), a class of neural networks that naturally output equilibria. Crucially, N- FPNs employ a constraint decoupling scheme to handle complicated agent action sets while avoiding expensive projections. Empirically, we find N-FPNs are compatible with the recently developed Jacobian-Free Backpropagation technique for training implicit networks, making them significantly faster and easier to train than prior models. Our experiments show N-FPNs are capable of scaling to problems orders of magnitude larger than existing learned game solvers.  ( 2 min )
    Decentralized SGD and Average-direction SAM are Asymptotically Equivalent. (arXiv:2306.02913v5 [cs.LG] UPDATED)
    Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-$\beta$-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios. The code is available at https://github.com/Raiden-Zhu/ICML-2023-DSGD-and-SAM.  ( 2 min )
    Training shallow ReLU networks on noisy data using hinge loss: when do we overfit and is it benign?. (arXiv:2306.09955v2 [cs.LG] UPDATED)
    We study benign overfitting in two-layer ReLU networks trained using gradient descent and hinge loss on noisy data for binary classification. In particular, we consider linearly separable data for which a relatively small proportion of labels are corrupted or flipped. We identify conditions on the margin of the clean data that give rise to three distinct training outcomes: benign overfitting, in which zero loss is achieved and with high probability test data is classified correctly; overfitting, in which zero loss is achieved but test data is misclassified with probability lower bounded by a constant; and non-overfitting, in which clean points, but not corrupt points, achieve zero loss and again with high probability test data is classified correctly. Our analysis provides a fine-grained description of the dynamics of neurons throughout training and reveals two distinct phases: in the first phase clean points achieve close to zero loss, in the second phase clean points oscillate on the boundary of zero loss while corrupt points either converge towards zero loss or are eventually zeroed by the network. We prove these results using a combinatorial approach that involves bounding the number of clean versus corrupt updates across these phases of training.  ( 3 min )
    Anytime-Constrained Reinforcement Learning. (arXiv:2311.05511v1 [cs.LG])
    We introduce and study constrained Markov Decision Processes (cMDPs) with anytime constraints. An anytime constraint requires the agent to never violate its budget at any point in time, almost surely. Although Markovian policies are no longer sufficient, we show that there exist optimal deterministic policies augmented with cumulative costs. In fact, we present a fixed-parameter tractable reduction from anytime-constrained cMDPs to unconstrained MDPs. Our reduction yields planning and learning algorithms that are time and sample-efficient for tabular cMDPs so long as the precision of the costs is logarithmic in the size of the cMDP. However, we also show that computing non-trivial approximately optimal policies is NP-hard in general. To circumvent this bottleneck, we design provable approximation algorithms that efficiently compute or learn an approximately feasible policy with optimal value so long as the maximum supported cost is bounded by a polynomial in the cMDP or by the absolute budget. Given our hardness results, our approximation guarantees are the best possible in terms of tractability under worst-case analysis.  ( 2 min )
    Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks. (arXiv:2311.05152v1 [cs.LG])
    In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specific information during encoding, which adversely affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. The source code and pre-trained models are available at https://github.com/haoyi-duan/DG-SCT.  ( 2 min )
    Embedding Space Interpolation Beyond Mini-Batch, Beyond Pairs and Beyond Examples. (arXiv:2311.05538v1 [cs.LG])
    Mixup refers to interpolation-based data augmentation, originally motivated as a way to go beyond empirical risk minimization (ERM). Its extensions mostly focus on the definition of interpolation and the space (input or feature) where it takes place, while the augmentation process itself is less studied. In most methods, the number of generated examples is limited to the mini-batch size and the number of examples being interpolated is limited to two (pairs), in the input space. We make progress in this direction by introducing MultiMix, which generates an arbitrarily large number of interpolated examples beyond the mini-batch size and interpolates the entire mini-batch in the embedding space. Effectively, we sample on the entire convex hull of the mini-batch rather than along linear segments between pairs of examples. On sequence data, we further extend to Dense MultiMix. We densely interpolate features and target labels at each spatial location and also apply the loss densely. To mitigate the lack of dense labels, we inherit labels from examples and weight interpolation factors by attention as a measure of confidence. Overall, we increase the number of loss terms per mini-batch by orders of magnitude at little additional cost. This is only possible because of interpolating in the embedding space. We empirically show that our solutions yield significant improvement over state-of-the-art mixup methods on four different benchmarks, despite interpolation being only linear. By analyzing the embedding space, we show that the classes are more tightly clustered and uniformly spread over the embedding space, thereby explaining the improved behavior.  ( 3 min )
    Automatically Score Tissue Images Like a Pathologist by Transfer Learning. (arXiv:2209.05954v3 [cs.LG] UPDATED)
    Cancer is the second leading cause of death in the world. Diagnosing cancer early on can save many lives. Pathologists have to look at tissue microarray (TMA) images manually to identify tumors, which can be time-consuming, inconsistent and subjective. Existing automatic algorithms either have not achieved the accuracy level of a pathologist or require substantial human involvements. A major challenge is that TMA images with different shapes, sizes, and locations can have the same score. Learning staining patterns in TMA images requires a huge number of images, which are severely limited due to privacy and regulation concerns in medical organizations. TMA images from different cancer types may share certain common characteristics, but combining them directly harms the accuracy due to heterogeneity in their staining patterns. Transfer learning is an emerging learning paradigm that allows borrowing strength from similar problems. However, existing approaches typically require a large sample from similar learning problems, while TMA images of different cancer types are often available in small sample size and further existing algorithms are limited to transfer learning from one similar problem. We propose a new transfer learning algorithm that could learn from multiple related problems, where each problem has a small sample and can have a substantially different distribution from the original one. The proposed algorithm has made it possible to break the critical accuracy barrier (the 75% accuracy level of pathologists), with a reported accuracy of 75.9% on breast cancer TMA images from the Stanford Tissue Microarray Database. It is supported by recent developments in transfer learning theory and empirical evidence in clustering technology. This will allow pathologists to confidently adopt automatic algorithms in recognizing tumors consistently with a higher accuracy in real time.  ( 3 min )
    Intervention Generalization: A View from Factor Graph Models. (arXiv:2306.04027v2 [stat.ML] UPDATED)
    One of the goals of causal inference is to generalize from past experiments and observational data to novel conditions. While it is in principle possible to eventually learn a mapping from a novel experimental condition to an outcome of interest, provided a sufficient variety of experiments is available in the training data, coping with a large combinatorial space of possible interventions is hard. Under a typical sparse experimental design, this mapping is ill-posed without relying on heavy regularization or prior distributions. Such assumptions may or may not be reliable, and can be hard to defend or test. In this paper, we take a close look at how to warrant a leap from past experiments to novel conditions based on minimal assumptions about the factorization of the distribution of the manipulated system, communicated in the well-understood language of factor graph models. A postulated $\textit{interventional factor model}$ (IFM) may not always be informative, but it conveniently abstracts away a need for explicitly modeling unmeasured confounding and feedback mechanisms, leading to directly testable claims. Given an IFM and datasets from a collection of experimental regimes, we derive conditions for identifiability of the expected outcomes of new regimes never observed in these training data. We implement our framework using several efficient algorithms, and apply them on a range of semi-synthetic experiments.  ( 2 min )
    GAN-based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and Solutions. (arXiv:2212.09015v2 [cs.DB] UPDATED)
    In data-driven systems, data exploration is imperative for making real-time decisions. However, big data is stored in massive databases that are difficult to retrieve. Approximate Query Processing (AQP) is a technique for providing approximate answers to aggregate queries based on a summary of the data (synopsis) that closely replicates the behavior of the actual data, which can be useful where an approximate answer to the queries would be acceptable in a fraction of the real execution time. This study explores the novel utilization of Generative Adversarial Networks (GANs) in the generation of tabular data that can be employed in AQP for synopsis construction. We thoroughly investigate the unique challenges posed by the synopsis construction process, including maintaining data distribution characteristics, handling bounded continuous and categorical data, and preserving semantic relationships and then introduce the advancement of tabular GAN architectures that overcome these challenges. Furthermore, we propose and validate a suite of statistical metrics tailored for assessing the reliability of the GAN-generated synopses. Our findings demonstrate that advanced GAN variations exhibit a promising capacity to generate high-fidelity synopses, potentially transforming the efficiency and effectiveness of AQP in data-driven systems.  ( 2 min )
    Antenna Response Consistency Driven Self-supervised Learning for WIFI-based Human Activity Recognition. (arXiv:2310.06328v2 [cs.LG] UPDATED)
    Self-supervised learning (SSL) for WiFi-based human activity recognition (HAR) holds great promise due to its ability to address the challenge of insufficient labeled data. However, directly transplanting SSL algorithms, especially contrastive learning, originally designed for other domains to CSI data, often fails to achieve the expected performance. We attribute this issue to the inappropriate alignment criteria, which disrupt the semantic distance consistency between the feature space and the input space. To address this challenge, we introduce \textbf{A}ntenna \textbf{R}esponse \textbf{C}onsistency (ARC) as a solution to define proper alignment criteria. ARC is designed to retain semantic information from the input space while introducing robustness to real-world noise. Moreover, we substantiate the effectiveness of ARC through a comprehensive set of experiments, demonstrating its capability to enhance the performance of self-supervised learning for WiFi-based HAR by achieving an increase of over 5\% in accuracy in most cases and achieving a best accuracy of 94.97\%.  ( 2 min )
    A Survey of Federated Unlearning: A Taxonomy, Challenges and Future Directions. (arXiv:2310.19218v2 [cs.LG] UPDATED)
    With the development of trustworthy Federated Learning (FL), the requirement of implementing right to be forgotten gives rise to the area of Federated Unlearning (FU). Comparing to machine unlearning, a major challenge of FU lies in the decentralized and privacy-preserving nature of FL, in which clients jointly train a global model without sharing their raw data, making it substantially more intricate to selectively unlearn specific information. In that regard, many efforts have been made to tackle the challenges of FU and have achieved significant progress. In this paper, we present a comprehensive survey of FU. Specially, we provide the existing algorithms, objectives, evaluation metrics, and identify some challenges of FU. By reviewing and comparing some studies, we summarize them into a taxonomy for various schemes, potential applications and future directions.  ( 2 min )
    Information-theoretic generalization bounds for learning from quantum data. (arXiv:2311.05529v1 [quant-ph])
    Learning tasks play an increasingly prominent role in quantum information and computation. They range from fundamental problems such as state discrimination and metrology over the framework of quantum probably approximately correct (PAC) learning, to the recently proposed shadow variants of state tomography. However, the many directions of quantum learning theory have so far evolved separately. We propose a general mathematical formalism for describing quantum learning by training on classical-quantum data and then testing how well the learned hypothesis generalizes to new data. In this framework, we prove bounds on the expected generalization error of a quantum learner in terms of classical and quantum information-theoretic quantities measuring how strongly the learner's hypothesis depends on the specific data seen during training. To achieve this, we use tools from quantum optimal transport and quantum concentration inequalities to establish non-commutative versions of decoupling lemmas that underlie recent information-theoretic generalization bounds for classical machine learning. Our framework encompasses and gives intuitively accessible generalization bounds for a variety of quantum learning scenarios such as quantum state discrimination, PAC learning quantum states, quantum parameter estimation, and quantumly PAC learning classical functions. Thereby, our work lays a foundation for a unifying quantum information-theoretic perspective on quantum learning.  ( 2 min )
    Dirichlet Active Learning. (arXiv:2311.05501v1 [stat.ML])
    This work introduces Dirichlet Active Learning (DiAL), a Bayesian-inspired approach to the design of active learning algorithms. Our framework models feature-conditional class probabilities as a Dirichlet random field and lends observational strength between similar features in order to calibrate the random field. This random field can then be utilized in learning tasks: in particular, we can use current estimates of mean and variance to conduct classification and active learning in the context where labeled data is scarce. We demonstrate the applicability of this model to low-label rate graph learning by constructing ``propagation operators'' based upon the graph Laplacian, and offer computational studies demonstrating the method's competitiveness with the state of the art. Finally, we provide rigorous guarantees regarding the ability of this approach to ensure both exploration and exploitation, expressed respectively in terms of cluster exploration and increased attention to decision boundaries.  ( 2 min )
    Transfer learning from a sparsely annotated dataset of 3D medical images. (arXiv:2311.05032v1 [eess.IV])
    Transfer learning leverages pre-trained model features from a large dataset to save time and resources when training new models for various tasks, potentially enhancing performance. Due to the lack of large datasets in the medical imaging domain, transfer learning from one medical imaging model to other medical imaging models has not been widely explored. This study explores the use of transfer learning to improve the performance of deep convolutional neural networks for organ segmentation in medical imaging. A base segmentation model (3D U-Net) was trained on a large and sparsely annotated dataset; its weights were used for transfer learning on four new down-stream segmentation tasks for which a fully annotated dataset was available. We analyzed the training set size's influence to simulate scarce data. The results showed that transfer learning from the base model was beneficial when small datasets were available, providing significant performance improvements; where fine-tuning the base model is more beneficial than updating all the network weights with vanilla transfer learning. Transfer learning with fine-tuning increased the performance by up to 0.129 (+28\%) Dice score than experiments trained from scratch, and on average 23 experiments increased the performance by 0.029 Dice score in the new segmentation tasks. The study also showed that cross-modality transfer learning using CT scans was beneficial. The findings of this study demonstrate the potential of transfer learning to improve the efficiency of annotation and increase the accessibility of accurate organ segmentation in medical imaging, ultimately leading to improved patient care. We made the network definition and weights publicly available to benefit other users and researchers.
    Familiarity-Based Open-Set Recognition Under Adversarial Attacks. (arXiv:2311.05006v1 [cs.CV])
    Open-set recognition (OSR), the identification of novel categories, can be a critical component when deploying classification models in real-world applications. Recent work has shown that familiarity-based scoring rules such as the Maximum Softmax Probability (MSP) or the Maximum Logit Score (MLS) are strong baselines when the closed-set accuracy is high. However, one of the potential weaknesses of familiarity-based OSR are adversarial attacks. Here, we present gradient-based adversarial attacks on familiarity scores for both types of attacks, False Familiarity and False Novelty attacks, and evaluate their effectiveness in informed and uninformed settings on TinyImageNet.  ( 2 min )
    Disease Gene Prioritization With Quantum Walks. (arXiv:2311.05486v1 [quant-ph])
    Disease gene prioritization assigns scores to genes or proteins according to their likely relevance for a given disease based on a provided set of seed genes. Here, we describe a new algorithm for disease gene prioritization based on continuous-time quantum walks using the adjacency matrix of a protein-protein interaction (PPI) network. Our algorithm can be seen as a quantum version of a previous method known as the diffusion kernel, but, importantly, has higher performance in predicting disease genes, and also permits the encoding of seed node self-loops into the underlying Hamiltonian, which offers yet another boost in performance. We demonstrate the success of our proposed method by comparing it to several well-known gene prioritization methods on three disease sets, across seven different PPI networks. In order to compare these methods, we use cross-validation and examine the mean reciprocal ranks and recall values. We further validate our method by performing an enrichment analysis of the predicted genes for coronary artery disease. We also investigate the impact of adding self-loops to the seeds, and argue that they allow the quantum walker to remain more local to low-degree seed nodes.  ( 2 min )
    Covering Number of Real Algebraic Varieties: Improved Bound and Applications. (arXiv:2311.05116v1 [math.AG])
    We prove an upper bound on the covering number of real algebraic varieties, images of polynomial maps and semialgebraic sets. The bound remarkably improves the best known bound by Yomdin-Comte, and its proof is much more straightforward. As a consequence, our result gives a bound on volume of the tubular neighborhood of a real variety, improving the results by Lotz and Basu-Lerario. We apply our theory to three main application domains. Firstly, we derive a near-optimal bound on the covering number of low rank CP tensors. Secondly, we prove a bound on the sketching dimension for (general) polynomial optimization problems. Lastly, we deduce generalization error bounds for deep neural networks with rational or ReLU activations, improving or matching the best known results in the literature.  ( 2 min )
    Automated Annotation of Scientific Texts for ML-based Keyphrase Extraction and Validation. (arXiv:2311.05042v1 [cs.IR])
    Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lacks the essential metadata required for researchers to find and search them effectively. The lack of metadata poses a significant challenge in the utilization of these datasets. Machine learning-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific datasets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming; thus, there is an need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining datasets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information about the unlabeled texts and the scientific domain. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly-specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.
    Mental Health Diagnosis in the Digital Age: Harnessing Sentiment Analysis on Social Media Platforms upon Ultra-Sparse Feature Content. (arXiv:2311.05075v1 [cs.LG])
    Amid growing global mental health concerns, particularly among vulnerable groups, natural language processing offers a tremendous potential for early detection and intervention of people's mental disorders via analyzing their postings and discussions on social media platforms. However, ultra-sparse training data, often due to vast vocabularies and low-frequency words, hinders the analysis accuracy. Multi-labeling and Co-occurrences of symptoms may also blur the boundaries in distinguishing similar/co-related disorders. To address these issues, we propose a novel semantic feature preprocessing technique with a three-folded structure: 1) mitigating the feature sparsity with a weak classifier, 2) adaptive feature dimension with modulus loops, and 3) deep-mining and extending features among the contexts. With enhanced semantic features, we train a machine learning model to predict and classify mental disorders. We utilize the Reddit Mental Health Dataset 2022 to examine conditions such as Anxiety, Borderline Personality Disorder (BPD), and Bipolar-Disorder (BD) and present solutions to the data sparsity challenge, highlighted by 99.81% non-zero elements. After applying our preprocessing technique, the feature sparsity decreases to 85.4%. Overall, our methods, when compared to seven benchmark models, demonstrate significant performance improvements: 8.0% in accuracy, 0.069 in precision, 0.093 in recall, 0.102 in F1 score, and 0.059 in AUC. This research provides foundational insights for mental health prediction and monitoring, providing innovative solutions to navigate challenges associated with ultra-sparse data feature and intricate multi-label classification in the domain of mental health analysis.  ( 3 min )
    RAPID: Training-free Retrieval-based Log Anomaly Detection with PLM considering Token-level information. (arXiv:2311.05160v1 [cs.LG])
    As the IT industry advances, system log data becomes increasingly crucial. Many computer systems rely on log texts for management due to restricted access to source code. The need for log anomaly detection is growing, especially in real-world applications, but identifying anomalies in rapidly accumulating logs remains a challenging task. Traditional deep learning-based anomaly detection models require dataset-specific training, leading to corresponding delays. Notably, most methods only focus on sequence-level log information, which makes the detection of subtle anomalies harder, and often involve inference processes that are difficult to utilize in real-time. We introduce RAPID, a model that capitalizes on the inherent features of log data to enable anomaly detection without training delays, ensuring real-time capability. RAPID treats logs as natural language, extracting representations using pre-trained language models. Given that logs can be categorized based on system context, we implement a retrieval-based technique to contrast test logs with the most similar normal logs. This strategy not only obviates the need for log-specific training but also adeptly incorporates token-level information, ensuring refined and robust detection, particularly for unseen logs. We also propose the core set technique, which can reduce the computational cost needed for comparison. Experimental results show that even without training on log data, RAPID demonstrates competitive performance compared to prior models and achieves the best performance on certain datasets. Through various research questions, we verified its capability for real-time detection without delay.  ( 2 min )
    Improving Computational Efficiency for Powered Descent Guidance via Transformer-based Tight Constraint Prediction. (arXiv:2311.05135v1 [math.OC])
    In this work, we present Transformer-based Powered Descent Guidance (T-PDG), a scalable algorithm for reducing the computational complexity of the direct optimization formulation of the spacecraft powered descent guidance problem. T-PDG uses data from prior runs of trajectory optimization algorithms to train a transformer neural network, which accurately predicts the relationship between problem parameters and the globally optimal solution for the powered descent guidance problem. The solution is encoded as the set of tight constraints corresponding to the constrained minimum-cost trajectory and the optimal final time of landing. By leveraging the attention mechanism of transformer neural networks, large sequences of time series data can be accurately predicted when given only the spacecraft state and landing site parameters. When applied to the real problem of Mars powered descent guidance, T-PDG reduces the time for computing the 3 degree of freedom fuel-optimal trajectory, when compared to lossless convexification, from an order of 1-8 seconds to less than 500 milliseconds. A safe and optimal solution is guaranteed by including a feasibility check in T-PDG before returning the final trajectory.  ( 2 min )
    Generalized test utilities for long-tail performance in extreme multi-label classification. (arXiv:2311.05081v1 [cs.LG])
    Extreme multi-label classification (XMLC) is the task of selecting a small subset of relevant labels from a very large set of possible labels. As such, it is characterized by long-tail labels, i.e., most labels have very few positive instances. With standard performance measures such as precision@k, a classifier can ignore tail labels and still report good performance. However, it is often argued that correct predictions in the tail are more interesting or rewarding, but the community has not yet settled on a metric capturing this intuitive concept. The existing propensity-scored metrics fall short on this goal by confounding the problems of long-tail and missing labels. In this paper, we analyze generalized metrics budgeted "at k" as an alternative solution. To tackle the challenging problem of optimizing these metrics, we formulate it in the expected test utility (ETU) framework, which aims at optimizing the expected performance on a fixed test set. We derive optimal prediction rules and construct computationally efficient approximations with provable regret guarantees and robustness against model misspecification. Our algorithm, based on block coordinate ascent, scales effortlessly to XMLC problems and obtains promising results in terms of long-tail performance.  ( 2 min )
    Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics. (arXiv:2311.05061v1 [cs.LG])
    Overparameterized models have proven to be powerful tools for solving various machine learning tasks. However, overparameterization often leads to a substantial increase in computational and memory costs, which in turn requires extensive resources to train. In this work, we aim to reduce this complexity by studying the learning dynamics of overparameterized deep networks. By extensively studying its learning dynamics, we unveil that the weight matrices of various architectures exhibit a low-dimensional structure. This finding implies that we can compress the networks by reducing the training to a small subspace. We take a step in developing a principled approach for compressing deep networks by studying deep linear models. We demonstrate that the principal components of deep linear models are fitted incrementally but within a small subspace, and use these insights to compress deep linear networks by decreasing the width of its intermediate layers. Remarkably, we observe that with a particular choice of initialization, the compressed network converges faster than the original network, consistently yielding smaller recovery errors throughout all iterations of gradient descent. We substantiate this observation by developing a theory focused on the deep matrix factorization problem, and by conducting empirical evaluations on deep matrix sensing. Finally, we demonstrate how our compressed model can enhance the utility of deep nonlinear models. Overall, we observe that our compression technique accelerates the training process by more than 2x, without compromising model quality.  ( 3 min )
    Counter-Empirical Attacking based on Adversarial Reinforcement Learning for Time-Relevant Scoring System. (arXiv:2311.05144v1 [cs.LG])
    Scoring systems are commonly seen for platforms in the era of big data. From credit scoring systems in financial services to membership scores in E-commerce shopping platforms, platform managers use such systems to guide users towards the encouraged activity pattern, and manage resources more effectively and more efficiently thereby. To establish such scoring systems, several "empirical criteria" are firstly determined, followed by dedicated top-down design for each factor of the score, which usually requires enormous effort to adjust and tune the scoring function in the new application scenario. What's worse, many fresh projects usually have no ground-truth or any experience to evaluate a reasonable scoring system, making the designing even harder. To reduce the effort of manual adjustment of the scoring function in every new scoring system, we innovatively study the scoring system from the preset empirical criteria without any ground truth, and propose a novel framework to improve the system from scratch. In this paper, we propose a "counter-empirical attacking" mechanism that can generate "attacking" behavior traces and try to break the empirical rules of the scoring system. Then an adversarial "enhancer" is applied to evaluate the scoring system and find the improvement strategy. By training the adversarial learning problem, a proper scoring function can be learned to be robust to the attacking activity traces that are trying to violate the empirical criteria. Extensive experiments have been conducted on two scoring systems including a shared computing resource platform and a financial credit system. The experimental results have validated the effectiveness of our proposed framework.  ( 3 min )
    Don't Waste a Single Annotation: Improving Single-Label Classifiers Through Soft Labels. (arXiv:2311.05265v1 [cs.CL])
    In this paper, we address the limitations of the common data annotation and training methods for objective single-label classification tasks. Typically, when annotating such tasks annotators are only asked to provide a single label for each sample and annotator disagreement is discarded when a final hard label is decided through majority voting. We challenge this traditional approach, acknowledging that determining the appropriate label can be difficult due to the ambiguity and lack of context in the data samples. Rather than discarding the information from such ambiguous annotations, our soft label method makes use of them for training. Our findings indicate that additional annotator information, such as confidence, secondary label and disagreement, can be used to effectively generate soft labels. Training classifiers with these soft labels then leads to improved performance and calibration on the hard label test set.  ( 2 min )
    Detecting Relevant Information in High-Volume Chat Logs: Keyphrase Extraction for Grooming and Drug Dealing Forensic Analysis. (arXiv:2311.04905v1 [cs.CL])
    The growing use of digital communication platforms has given rise to various criminal activities, such as grooming and drug dealing, which pose significant challenges to law enforcement and forensic experts. This paper presents a supervised keyphrase extraction approach to detect relevant information in high-volume chat logs involving grooming and drug dealing for forensic analysis. The proposed method, JointKPE++, builds upon the JointKPE keyphrase extractor by employing improvements to handle longer texts effectively. We evaluate JointKPE++ using BERT-based pre-trained models on grooming and drug dealing datasets, including BERT, RoBERTa, SpanBERT, and BERTimbau. The results show significant improvements over traditional approaches and demonstrate the potential for JointKPE++ to aid forensic experts in efficiently detecting keyphrases related to criminal activities.  ( 2 min )
  • Open

    Sparse PCA Beyond Covariance Thresholding. (arXiv:2302.10158v2 [cs.LG] UPDATED)
    In the Wishart model for sparse PCA we are given $n$ samples $Y_1,\ldots, Y_n$ drawn independently from a $d$-dimensional Gaussian distribution $N({0, Id + \beta vv^\top})$, where $\beta > 0$ and $v\in \mathbb{R}^d$ is a $k$-sparse unit vector, and we wish to recover $v$ (up to sign). We show that if $n \ge \Omega(d)$, then for every $t \ll k$ there exists an algorithm running in time $n\cdot d^{O(t)}$ that solves this problem as long as \[ \beta \gtrsim \frac{k}{\sqrt{nt}}\sqrt{\ln({2 + td/k^2})}\,. \] Prior to this work, the best polynomial time algorithm in the regime $k\approx \sqrt{d}$, called \emph{Covariance Thresholding} (proposed in [KNV15a] and analyzed in [DM14]), required $\beta \gtrsim \frac{k}{\sqrt{n}}\sqrt{\ln({2 + d/k^2})}$. For large enough constant $t$ our algorithm runs in polynomial time and has better guarantees than Covariance Thresholding. Previously known algorithms with such guarantees required quasi-polynomial time $d^{O(\log d)}$. In addition, we show that our techniques work with sparse PCA with adversarial perturbations studied in [dKNS20]. This model generalizes not only sparse PCA, but also other problems studied in prior works, including the sparse planted vector problem. As a consequence, we provide polynomial time algorithms for the sparse planted vector problem that have better guarantees than the state of the art in some regimes. Our approach also works with the Wigner model for sparse PCA. Moreover, we show that it is possible to combine our techniques with recent results on sparse PCA with symmetric heavy-tailed noise [dNNS22]. In particular, in the regime $k \approx \sqrt{d}$ we get the first polynomial time algorithm that works with symmetric heavy-tailed noise, while the algorithm from [dNNS22]. requires quasi-polynomial time in these settings.  ( 3 min )
    Bayesian sequential design of computer experiments for quantile set inversion. (arXiv:2211.01008v3 [stat.ML] UPDATED)
    We consider an unknown multivariate function representing a system-such as a complex numerical simulator-taking both deterministic and uncertain inputs. Our objective is to estimate the set of deterministic inputs leading to outputs whose probability (with respect to the distribution of the uncertain inputs) of belonging to a given set is less than a given threshold. This problem, which we call Quantile Set Inversion (QSI), occurs for instance in the context of robust (reliability-based) optimization problems, when looking for the set of solutions that satisfy the constraints with sufficiently large probability. To solve the QSI problem, we propose a Bayesian strategy based on Gaussian process modeling and the Stepwise Uncertainty Reduction (SUR) principle, to sequentially choose the points at which the function should be evaluated to efficiently approximate the set of interest. We illustrate the performance and interest of the proposed SUR strategy through several numerical experiments.  ( 2 min )
    Counterfactually Comparing Abstaining Classifiers. (arXiv:2305.10564v2 [stat.ML] UPDATED)
    Abstaining classifiers have the option to abstain from making predictions on inputs that they are unsure about. These classifiers are becoming increasingly popular in high-stakes decision-making problems, as they can withhold uncertain predictions to improve their reliability and safety. When evaluating black-box abstaining classifier(s), however, we lack a principled approach that accounts for what the classifier would have predicted on its abstentions. These missing predictions matter when they can eventually be utilized, either directly or as a backup option in a failure mode. In this paper, we introduce a novel approach and perspective to the problem of evaluating and comparing abstaining classifiers by treating abstentions as missing data. Our evaluation approach is centered around defining the counterfactual score of an abstaining classifier, defined as the expected performance of the classifier had it not been allowed to abstain. We specify the conditions under which the counterfactual score is identifiable: if the abstentions are stochastic, and if the evaluation data is independent of the training data (ensuring that the predictions are missing at random), then the score is identifiable. Note that, if abstentions are deterministic, then the score is unidentifiable because the classifier can perform arbitrarily poorly on its abstentions. Leveraging tools from observational causal inference, we then develop nonparametric and doubly robust methods to efficiently estimate this quantity under identification. Our approach is examined in both simulated and real data experiments.  ( 3 min )
    Bayesian Prognostic Covariate Adjustment With Additive Mixture Priors. (arXiv:2310.18027v2 [stat.ME] UPDATED)
    Effective and rapid decision-making from randomized controlled trials (RCTs) requires unbiased and precise treatment effect inferences. Two strategies to address this requirement are to adjust for covariates that are highly correlated with the outcome, and to leverage historical control information via Bayes' theorem. We propose a new Bayesian prognostic covariate adjustment methodology, referred to as Bayesian PROCOVA, that combines these two strategies. Covariate adjustment in Bayesian PROCOVA is based on generative artificial intelligence (AI) algorithms that construct a digital twin generator (DTG) for RCT participants. The DTG is trained on historical control data and yields a digital twin (DT) probability distribution for each RCT participant's outcome under the control treatment. The expectation of the DT distribution, referred to as the prognostic score, defines the covariate for adjustment. Historical control information is leveraged via an additive mixture prior with two components: an informative prior probability distribution specified based on historical control data, and a weakly informative prior distribution. The mixture weight determines the extent to which posterior inferences are drawn from the informative component, versus the weakly informative component. This weight has a prior distribution as well, and so the entire additive mixture prior is completely pre-specifiable without involving any RCT information. We establish an efficient Gibbs algorithm for sampling from the posterior distribution, and derive closed-form expressions for the posterior mean and variance of the treatment effect parameter conditional on the weight, in Bayesian PROCOVA. We evaluate efficiency gains of Bayesian PROCOVA via its bias control and variance reduction compared to frequentist PROCOVA in simulation studies that encompass different discrepancies. These gains translate to smaller RCTs.  ( 3 min )
    Simple steps are all you need: Frank-Wolfe and generalized self-concordant functions. (arXiv:2105.13913v7 [math.OC] UPDATED)
    Generalized self-concordance is a key property present in the objective function of many important learning problems. We establish the convergence rate of a simple Frank-Wolfe variant that uses the open-loop step size strategy $\gamma_t = 2/(t+2)$, obtaining a $\mathcal{O}(1/t)$ convergence rate for this class of functions in terms of primal gap and Frank-Wolfe gap, where $t$ is the iteration count. This avoids the use of second-order information or the need to estimate local smoothness parameters of previous work. We also show improved convergence rates for various common cases, e.g., when the feasible region under consideration is uniformly convex or polyhedral.  ( 2 min )
    Estimation of forest height and biomass from open-access multi-sensor satellite imagery and GEDI Lidar data: high-resolution maps of metropolitan France. (arXiv:2310.14662v2 [stat.ML] UPDATED)
    Mapping forest resources and carbon is important for improving forest management and meeting the objectives of storing carbon and preserving the environment. Spaceborne remote sensing approaches have considerable potential to support forest height monitoring by providing repeated observations at high spatial resolution over large areas. This study uses a machine learning approach that was previously developed to produce local maps of forest parameters (basal area, height, diameter, etc.). The aim of this paper is to present the extension of the approach to much larger scales such as the French national coverage. We used the GEDI Lidar mission as reference height data, and the satellite images from Sentinel-1, Sentinel-2 and ALOS-2 PALSA-2 to estimate forest height and produce a map of France for the year 2020. The height map is then derived into volume and aboveground biomass (AGB) using allometric equations. The validation of the height map with local maps from ALS data shows an accuracy close to the state of the art, with a mean absolute error (MAE) of 4.3 m. Validation on inventory plots representative of French forests shows an MAE of 3.7 m for the height. Estimates are slightly better for coniferous than for broadleaved forests. Volume and AGB maps derived from height shows MAEs of 75 tons/ha and 93 m${}^3$/ha respectively. The results aggregated by sylvo-ecoregion and forest types (owner and species) are further improved, with MAEs of 23 tons/ha and 30 m${}^3$/ha. The precision of these maps allows to monitor forests locally, as well as helping to analyze forest resources and carbon on a territorial scale or on specific types of forests by combining the maps with geolocated information (administrative area, species, type of owner, protected areas, environmental conditions, etc.). Height, volume and AGB maps produced in this study are made freely available.  ( 3 min )
    Meta-learning of semi-supervised learning from tasks with heterogeneous attribute spaces. (arXiv:2311.05088v1 [cs.LG])
    We propose a meta-learning method for semi-supervised learning that learns from multiple tasks with heterogeneous attribute spaces. The existing semi-supervised meta-learning methods assume that all tasks share the same attribute space, which prevents us from learning with a wide variety of tasks. With the proposed method, the expected test performance on tasks with a small amount of labeled data is improved with unlabeled data as well as data in various tasks, where the attribute spaces are different among tasks. The proposed method embeds labeled and unlabeled data simultaneously in a task-specific space using a neural network, and the unlabeled data's labels are estimated by adapting classification or regression models in the embedding space. For the neural network, we develop variable-feature self-attention layers, which enable us to find embeddings of data with different attribute spaces with a single neural network by considering interactions among examples, attributes, and labels. Our experiments on classification and regression datasets with heterogeneous attribute spaces demonstrate that our proposed method outperforms the existing meta-learning and semi-supervised learning methods.  ( 2 min )
    Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics. (arXiv:2311.05061v1 [cs.LG])
    Overparameterized models have proven to be powerful tools for solving various machine learning tasks. However, overparameterization often leads to a substantial increase in computational and memory costs, which in turn requires extensive resources to train. In this work, we aim to reduce this complexity by studying the learning dynamics of overparameterized deep networks. By extensively studying its learning dynamics, we unveil that the weight matrices of various architectures exhibit a low-dimensional structure. This finding implies that we can compress the networks by reducing the training to a small subspace. We take a step in developing a principled approach for compressing deep networks by studying deep linear models. We demonstrate that the principal components of deep linear models are fitted incrementally but within a small subspace, and use these insights to compress deep linear networks by decreasing the width of its intermediate layers. Remarkably, we observe that with a particular choice of initialization, the compressed network converges faster than the original network, consistently yielding smaller recovery errors throughout all iterations of gradient descent. We substantiate this observation by developing a theory focused on the deep matrix factorization problem, and by conducting empirical evaluations on deep matrix sensing. Finally, we demonstrate how our compressed model can enhance the utility of deep nonlinear models. Overall, we observe that our compression technique accelerates the training process by more than 2x, without compromising model quality.  ( 3 min )
    Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees. (arXiv:2305.01588v2 [cs.LG] UPDATED)
    Gradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value $c >0$. It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its convergence guarantees often require specific values of $c$ and strong noise assumptions. In this paper, we give convergence guarantees that show precise dependence on arbitrary clipping thresholds $c$ and show that our guarantees are tight with both deterministic and stochastic gradients. In particular, we show that (i) for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence, (ii) in the stochastic setting convergence to the true optimum cannot be guaranteed under the standard noise assumption, even under arbitrary small step-sizes. We give matching upper and lower bounds for convergence of the gradient norm when running clipped SGD, and illustrate these results with experiments.  ( 2 min )
    Implicit Variational Inference for High-Dimensional Posteriors. (arXiv:2310.06643v3 [cs.LG] UPDATED)
    In variational inference, the benefits of Bayesian models rely on accurately capturing the true posterior distribution. We propose using neural samplers that specify implicit distributions, which are well-suited for approximating complex multimodal and correlated posteriors in high-dimensional spaces. Our approach introduces novel bounds for approximate inference using implicit distributions by locally linearising the neural sampler. This is distinct from existing methods that rely on additional discriminator networks and unstable adversarial objectives. Furthermore, we present a new sampler architecture that, for the first time, enables implicit distributions over tens of millions of latent variables, addressing computational concerns by using differentiable numerical approximations. We empirically show that our method is capable of recovering correlations across layers in large Bayesian neural networks, a property that is crucial for a network's performance but notoriously challenging to achieve. To the best of our knowledge, no other method has been shown to accomplish this task for such large models. Through experiments in downstream tasks, we demonstrate that our expressive posteriors outperform state-of-the-art uncertainty quantification methods, validating the effectiveness of our training algorithm and the quality of the learned implicit approximation.  ( 2 min )
    Identifiability Guarantees for Causal Disentanglement from Soft Interventions. (arXiv:2307.06250v3 [stat.ML] UPDATED)
    Causal disentanglement aims to uncover a representation of data using latent variables that are interrelated through a causal model. Such a representation is identifiable if the latent model that explains the data is unique. In this paper, we focus on the scenario where unpaired observational and interventional data are available, with each intervention changing the mechanism of a latent variable. When the causal variables are fully observed, statistically consistent algorithms have been developed to identify the causal model under faithfulness assumptions. We here show that identifiability can still be achieved with unobserved causal variables, given a generalized notion of faithfulness. Our results guarantee that we can recover the latent causal model up to an equivalence class and predict the effect of unseen combinations of interventions, in the limit of infinite data. We implement our causal disentanglement framework by developing an autoencoding variational Bayes algorithm and apply it to the problem of predicting combinatorial perturbation effects in genomics.  ( 2 min )
    A posteriori Trading-inspired Model-free Time Series Segmentation. (arXiv:1912.06708v2 [stat.ML] UPDATED)
    Within the context of multivariate time series segmentation this paper proposes a method inspired by a posteriori optimal trading. After a normalization step time series are treated channel-wise as surrogate stock prices that can be traded optimally a posteriori in a virtual portfolio holding either stock or cash. Linear transaction costs are interpreted as hyperparameters for noise filtering. Resulting trading signals as well as resulting trading signals obtained on the reversed time series are used for unsupervised labeling, before a consensus over channels is reached that determines segmentation time instants. The method is model-free such that no model prescriptions for segments are made. Benefits of proposed approach include simplicity, adaptability to a wide range of different shapes of time series, and in particular computational efficiency that make it suitable for big data. Performance is demonstrated on synthetic and real-world data, including a large-scale dataset comprising a multivariate time series of dimension 1000 and length 2709. Proposed method is compared to a popular model-based bottom-up approach fitting piecewise affine models and to a state-of-the-art model-based top-down approach fitting Gaussian models, and found to be consistently faster while producing more intuitive results.  ( 2 min )
    Optimization on Manifolds via Graph Gaussian Processes. (arXiv:2210.10962v3 [stat.ML] UPDATED)
    This paper integrates manifold learning techniques within a \emph{Gaussian process upper confidence bound} algorithm to optimize an objective function on a manifold. Our approach is motivated by applications where a full representation of the manifold is not available and querying the objective is expensive. We rely on a point cloud of manifold samples to define a graph Gaussian process surrogate model for the objective. Query points are sequentially chosen using the posterior distribution of the surrogate model given all previous queries. We establish regret bounds in terms of the number of queries and the size of the point cloud. Several numerical examples complement the theory and illustrate the performance of our method.  ( 2 min )
    Prediction-Powered Inference. (arXiv:2301.09633v4 [stat.ML] UPDATED)
    Prediction-powered inference is a framework for performing valid statistical inference when an experimental dataset is supplemented with predictions from a machine-learning system. The framework yields simple algorithms for computing provably valid confidence intervals for quantities such as means, quantiles, and linear and logistic regression coefficients, without making any assumptions on the machine-learning algorithm that supplies the predictions. Furthermore, more accurate predictions translate to smaller confidence intervals. Prediction-powered inference could enable researchers to draw valid and more data-efficient conclusions using machine learning. The benefits of prediction-powered inference are demonstrated with datasets from proteomics, astronomy, genomics, remote sensing, census analysis, and ecology.  ( 2 min )
    When Meta-Learning Meets Online and Continual Learning: A Survey. (arXiv:2311.05241v1 [cs.LG])
    Over the past decade, deep neural networks have demonstrated significant success using the training scheme that involves mini-batch stochastic gradient descent on extensive datasets. Expanding upon this accomplishment, there has been a surge in research exploring the application of neural networks in other learning scenarios. One notable framework that has garnered significant attention is meta-learning. Often described as "learning to learn," meta-learning is a data-driven approach to optimize the learning algorithm. Other branches of interest are continual learning and online learning, both of which involve incrementally updating a model with streaming data. While these frameworks were initially developed independently, recent works have started investigating their combinations, proposing novel problem settings and learning algorithms. However, due to the elevated complexity and lack of unified terminology, discerning differences between the learning frameworks can be challenging even for experienced researchers. To facilitate a clear understanding, this paper provides a comprehensive survey that organizes various problem settings using consistent terminology and formal descriptions. By offering an overview of these learning paradigms, our work aims to foster further advancements in this promising area of research.  ( 2 min )
    The Sample Complexity Of ERMs In Stochastic Convex Optimization. (arXiv:2311.05398v1 [cs.LG])
    Stochastic convex optimization is one of the most well-studied models for learning in modern machine learning. Nevertheless, a central fundamental question in this setup remained unresolved: "How many data points must be observed so that any empirical risk minimizer (ERM) shows good performance on the true population?" This question was proposed by Feldman (2016), who proved that $\Omega(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ data points are necessary (where $d$ is the dimension and $\epsilon>0$ is the accuracy parameter). Proving an $\omega(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ lower bound was left as an open problem. In this work we show that in fact $\tilde{O}(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ data points are also sufficient. This settles the question and yields a new separation between ERMs and uniform convergence. This sample complexity holds for the classical setup of learning bounded convex Lipschitz functions over the Euclidean unit ball. We further generalize the result and show that a similar upper bound holds for all symmetric convex bodies. The general bound is composed of two terms: (i) a term of the form $\tilde{O}(\frac{d}{\epsilon})$ with an inverse-linear dependence on the accuracy parameter, and (ii) a term that depends on the statistical complexity of the class of $\textit{linear}$ functions (captured by the Rademacher complexity). The proof builds a mechanism for controlling the behavior of stochastic convex optimization problems.  ( 2 min )
    Penalising the biases in norm regularisation enforces sparsity. (arXiv:2303.01353v3 [stat.ML] UPDATED)
    Controlling the parameters' norm often yields good generalisation when training neural networks. Beyond simple intuitions, the relation between regularising parameters' norm and obtained estimators remains theoretically misunderstood. For one hidden ReLU layer networks with unidimensional data, this work shows the parameters' norm required to represent a function is given by the total variation of its second derivative, weighted by a $\sqrt{1+x^2}$ factor. Notably, this weighting factor disappears when the norm of bias terms is not regularised. The presence of this additional weighting factor is of utmost significance as it is shown to enforce the uniqueness and sparsity (in the number of kinks) of the minimal norm interpolator. Conversely, omitting the bias' norm allows for non-sparse solutions. Penalising the bias terms in the regularisation, either explicitly or implicitly, thus leads to sparse estimators.  ( 2 min )
    Outlier-Robust Wasserstein DRO. (arXiv:2311.05573v1 [stat.ML])
    Distributionally robust optimization (DRO) is an effective approach for data-driven decision-making in the presence of uncertainty. Geometric uncertainty due to sampling or localized perturbations of data points is captured by Wasserstein DRO (WDRO), which seeks to learn a model that performs uniformly well over a Wasserstein ball centered around the observed data distribution. However, WDRO fails to account for non-geometric perturbations such as adversarial outliers, which can greatly distort the Wasserstein distance measurement and impede the learned model. We address this gap by proposing a novel outlier-robust WDRO framework for decision-making under both geometric (Wasserstein) perturbations and non-geometric (total variation (TV)) contamination that allows an $\varepsilon$-fraction of data to be arbitrarily corrupted. We design an uncertainty set using a certain robust Wasserstein ball that accounts for both perturbation types and derive minimax optimal excess risk bounds for this procedure that explicitly capture the Wasserstein and TV risks. We prove a strong duality result that enables tractable convex reformulations and efficient computation of our outlier-robust WDRO problem. When the loss function depends only on low-dimensional features of the data, we eliminate certain dimension dependencies from the risk bounds that are unavoidable in the general setting. Finally, we present experiments validating our theory on standard regression and classification tasks.  ( 2 min )
    Dirichlet Active Learning. (arXiv:2311.05501v1 [stat.ML])
    This work introduces Dirichlet Active Learning (DiAL), a Bayesian-inspired approach to the design of active learning algorithms. Our framework models feature-conditional class probabilities as a Dirichlet random field and lends observational strength between similar features in order to calibrate the random field. This random field can then be utilized in learning tasks: in particular, we can use current estimates of mean and variance to conduct classification and active learning in the context where labeled data is scarce. We demonstrate the applicability of this model to low-label rate graph learning by constructing ``propagation operators'' based upon the graph Laplacian, and offer computational studies demonstrating the method's competitiveness with the state of the art. Finally, we provide rigorous guarantees regarding the ability of this approach to ensure both exploration and exploitation, expressed respectively in terms of cluster exploration and increased attention to decision boundaries.  ( 2 min )
    Comparing Sequential Forecasters. (arXiv:2110.00115v6 [stat.ME] UPDATED)
    Consider two forecasters, each making a single prediction for a sequence of events over time. We ask a relatively basic question: how might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts and outcomes were generated? In this paper, we present a rigorous answer to this question by designing novel sequential inference procedures for estimating the time-varying difference in forecast scores. To do this, we employ confidence sequences (CS), which are sequences of confidence intervals that can be continuously monitored and are valid at arbitrary data-dependent stopping times ("anytime-valid"). The widths of our CSs are adaptive to the underlying variance of the score differences. Underlying their construction is a game-theoretic statistical framework, in which we further identify e-processes and p-processes for sequentially testing a weak null hypothesis -- whether one forecaster outperforms another on average (rather than always). Our methods do not make distributional assumptions on the forecasts or outcomes; our main theorems apply to any bounded scores, and we later provide alternative methods for unbounded scores. We empirically validate our approaches by comparing real-world baseball and weather forecasters.  ( 3 min )
    Fair Wasserstein Coresets. (arXiv:2311.05436v1 [stat.ML])
    Recent technological advancements have given rise to the ability of collecting vast amounts of data, that often exceed the capacity of commonly used machine learning algorithms. Approaches such as coresets and synthetic data distillation have emerged as frameworks to generate a smaller, yet representative, set of samples for downstream training. As machine learning is increasingly applied to decision-making processes, it becomes imperative for modelers to consider and address biases in the data concerning subgroups defined by factors like race, gender, or other sensitive attributes. Current approaches focus on creating fair synthetic representative samples by optimizing local properties relative to the original samples. These methods, however, are not guaranteed to positively affect the performance or fairness of downstream learning processes. In this work, we present Fair Wasserstein Coresets (FWC), a novel coreset approach which generates fair synthetic representative samples along with sample-level weights to be used in downstream learning tasks. FWC aims to minimize the Wasserstein distance between the original datasets and the weighted synthetic samples while enforcing (an empirical version of) demographic parity, a prominent criterion for algorithmic fairness, via a linear constraint. We show that FWC can be thought of as a constrained version of Lloyd's algorithm for k-medians or k-means clustering. Our experiments, conducted on both synthetic and real datasets, demonstrate the scalability of our approach and highlight the competitive performance of FWC compared to existing fair clustering approaches, even when attempting to enhance the fairness of the latter through fair pre-processing techniques.  ( 2 min )
    Accelerating Exploration with Unlabeled Prior Data. (arXiv:2311.05067v1 [cs.LG])
    Learning to solve tasks from a sparse reward signal is a major challenge for standard reinforcement learning (RL) algorithms. However, in the real world, agents rarely need to solve sparse reward tasks entirely from scratch. More often, we might possess prior experience to draw on that provides considerable guidance about which actions and outcomes are possible in the world, which we can use to explore more effectively for new tasks. In this work, we study how prior data without reward labels may be used to guide and accelerate exploration for an agent solving a new sparse reward task. We propose a simple approach that learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards, and then uses it concurrently alongside the online data for downstream policy and critic optimization. This general formula leads to rapid exploration in several challenging sparse-reward domains where tabula rasa exploration is insufficient, including the AntMaze domain, Adroit hand manipulation domain, and a visual simulated robotic manipulation domain. Our results highlight the ease of incorporating unlabeled prior data into existing online RL algorithms, and the (perhaps surprising) effectiveness of doing so.  ( 2 min )
    A theory of data variability in Neural Network Bayesian inference. (arXiv:2307.16695v2 [cond-mat.dis-nn] UPDATED)
    Bayesian inference and kernel methods are well established in machine learning. The neural network Gaussian process in particular provides a concept to investigate neural networks in the limit of infinitely wide hidden layers by using kernel and inference methods. Here we build upon this limit and provide a field-theoretic formalism which covers the generalization properties of infinitely wide networks. We systematically compute generalization properties of linear, non-linear, and deep non-linear networks for kernel matrices with heterogeneous entries. In contrast to currently employed spectral methods we derive the generalization properties from the statistical properties of the input, elucidating the interplay of input dimensionality, size of the training data set, and variability of the data. We show that data variability leads to a non-Gaussian action reminiscent of a ($\varphi^3+\varphi^4$)-theory. Using our formalism on a synthetic task and on MNIST we obtain a homogeneous kernel matrix approximation for the learning curve as well as corrections due to data variability which allow the estimation of the generalization properties and exact results for the bounds of the learning curves in the case of infinitely many training data points.  ( 2 min )
    A Generalized Variable Importance Metric and Estimator for Black Box Machine Learning Models. (arXiv:2212.09931v2 [stat.CO] UPDATED)
    The aim of this study is to define importance of predictors for black box machine learning methods, where the prediction function can be complex and cannot be represented by statistical parameters. In this paper we defined a ``Generalized Variable Importance Metric (GVIM)'' using the true conditional expectation function for a continuous or a binary response variable. We further showed that the defined GVIM can be represented as a function of the Conditional Average Treatment Effect (CATE) for multinomial and continuous predictors. Then we propose how the metric can be estimated using using any machine learning models. Finally using simulations we evaluated the properties of the estimator when estimated from XGBoost, Random Forest and a mis-specified generalized additive model.  ( 2 min )
    Quantifying Gender Bias Towards Politicians in Cross-Lingual Language Models. (arXiv:2104.07505v2 [cs.CL] UPDATED)
    Recent research has demonstrated that large pre-trained language models reflect societal biases expressed in natural language. The present paper introduces a simple method for probing language models to conduct a multilingual study of gender bias towards politicians. We quantify the usage of adjectives and verbs generated by language models surrounding the names of politicians as a function of their gender. To this end, we curate a dataset of 250k politicians worldwide, including their names and gender. Our study is conducted in seven languages across six different language modeling architectures. The results demonstrate that pre-trained language models' stance towards politicians varies strongly across analyzed languages. We find that while some words such as dead, and designated are associated with both male and female politicians, a few specific words such as beautiful and divorced are predominantly associated with female politicians. Finally, and contrary to previous findings, our study suggests that larger language models do not tend to be significantly more gender-biased than smaller ones.  ( 2 min )
    A Coefficient Makes SVRG Effective. (arXiv:2311.05589v1 [cs.LG])
    Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang (2013), is a theoretically compelling optimization method. However, as Defazio & Bottou (2019) highlights, its effectiveness in deep learning is yet to be proven. In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks. Our analysis finds that, for deeper networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses. Inspired by this, we introduce a multiplicative coefficient $\alpha$ to control the strength and adjust it through a linear decay schedule. We name our method $\alpha$-SVRG. Our results show $\alpha$-SVRG better optimizes neural networks, consistently reducing training loss compared to both baseline and the standard SVRG across various architectures and image classification datasets. We hope our findings encourage further exploration into variance reduction techniques in deep learning. Code is available at https://github.com/davidyyd/alpha-SVRG.  ( 2 min )
    Decentralized SGD and Average-direction SAM are Asymptotically Equivalent. (arXiv:2306.02913v5 [cs.LG] UPDATED)
    Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-$\beta$-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios. The code is available at https://github.com/Raiden-Zhu/ICML-2023-DSGD-and-SAM.  ( 2 min )
    Intervention Generalization: A View from Factor Graph Models. (arXiv:2306.04027v2 [stat.ML] UPDATED)
    One of the goals of causal inference is to generalize from past experiments and observational data to novel conditions. While it is in principle possible to eventually learn a mapping from a novel experimental condition to an outcome of interest, provided a sufficient variety of experiments is available in the training data, coping with a large combinatorial space of possible interventions is hard. Under a typical sparse experimental design, this mapping is ill-posed without relying on heavy regularization or prior distributions. Such assumptions may or may not be reliable, and can be hard to defend or test. In this paper, we take a close look at how to warrant a leap from past experiments to novel conditions based on minimal assumptions about the factorization of the distribution of the manipulated system, communicated in the well-understood language of factor graph models. A postulated $\textit{interventional factor model}$ (IFM) may not always be informative, but it conveniently abstracts away a need for explicitly modeling unmeasured confounding and feedback mechanisms, leading to directly testable claims. Given an IFM and datasets from a collection of experimental regimes, we derive conditions for identifiability of the expected outcomes of new regimes never observed in these training data. We implement our framework using several efficient algorithms, and apply them on a range of semi-synthetic experiments.  ( 2 min )
    Uncertainty Wrapper in the medical domain: Establishing transparent uncertainty quantification for opaque machine learning models in practice. (arXiv:2311.05245v1 [cs.LG])
    When systems use data-based models that are based on machine learning (ML), errors in their results cannot be ruled out. This is particularly critical if it remains unclear to the user how these models arrived at their decisions and if errors can have safety-relevant consequences, as is often the case in the medical field. In such cases, the use of dependable methods to quantify the uncertainty remaining in a result allows the user to make an informed decision about further usage and draw possible conclusions based on a given result. This paper demonstrates the applicability and practical utility of the Uncertainty Wrapper using flow cytometry as an application from the medical field that can benefit from the use of ML models in conjunction with dependable and transparent uncertainty quantification.  ( 2 min )
    On the Consistency of Maximum Likelihood Estimation of Probabilistic Principal Component Analysis. (arXiv:2311.05046v1 [stat.ML])
    Probabilistic principal component analysis (PPCA) is currently one of the most used statistical tools to reduce the ambient dimension of the data. From multidimensional scaling to the imputation of missing data, PPCA has a broad spectrum of applications ranging from science and engineering to quantitative finance. Despite this wide applicability in various fields, hardly any theoretical guarantees exist to justify the soundness of the maximal likelihood (ML) solution for this model. In fact, it is well known that the maximum likelihood estimation (MLE) can only recover the true model parameters up to a rotation. The main obstruction is posed by the inherent identifiability nature of the PPCA model resulting from the rotational symmetry of the parameterization. To resolve this ambiguity, we propose a novel approach using quotient topological spaces and in particular, we show that the maximum likelihood solution is consistent in an appropriate quotient Euclidean space. Furthermore, our consistency results encompass a more general class of estimators beyond the MLE. Strong consistency of the ML estimate and consequently strong covariance estimation of the PPCA model have also been established under a compactness assumption.  ( 2 min )
    Online Kernel CUSUM for Change-Point Detection. (arXiv:2211.15070v5 [stat.ME] UPDATED)
    We present a computationally efficient online kernel Cumulative Sum (CUSUM) method for change-point detection that utilizes the maximum over a set of kernel statistics to account for the unknown change-point location. Our approach exhibits increased sensitivity to small changes compared to existing kernel-based change-point detection methods, including Scan-B statistic, corresponding to a non-parametric Shewhart chart-type procedure. We provide accurate analytic approximations for two key performance metrics: the Average Run Length (ARL) and Expected Detection Delay (EDD), which enable us to establish an optimal window length to be on the order of the logarithm of ARL to ensure minimal power loss relative to an oracle procedure with infinite memory. Moreover, we introduce a recursive calculation procedure for detection statistics to ensure constant computational and memory complexity, which is essential for online implementation. Through extensive experiments on both simulated and real data, we demonstrate the competitive performance of our method and validate our theoretical results.  ( 2 min )
    Consensus-based construction of high-dimensional free energy surface. (arXiv:2311.05009v1 [physics.comp-ph])
    One essential problem in quantifying the collective behaviors of molecular systems lies in the accurate construction of free energy surfaces (FESs). The main challenges arise from the prevalence of energy barriers and the high dimensionality. Existing approaches are often based on sophisticated enhanced sampling methods to establish efficient exploration of the full-phase space. On the other hand, the collection of optimal sample points for the numerical approximation of FESs remains largely under-explored, where the discretization error could become dominant for systems with a large number of collective variables (CVs). We propose a consensus sampling-based approach by reformulating the construction as a minimax problem which simultaneously optimizes the function representation and the training set. In particular, the maximization step establishes a stochastic interacting particle system to achieve the adaptive sampling of the max-residue regime by modulating the exploitation of the Laplace approximation of the current loss function and the exploration of the uncharted phase space; the minimization step updates the FES approximation with the new training set. By iteratively solving the minimax problem, the present method essentially achieves an adversarial learning of the FESs with unified tasks for both phase space exploration and posterior error-enhanced sampling. We demonstrate the method by constructing the FESs of molecular systems with a number of CVs up to 30.  ( 2 min )
    Unbiased Kinetic Langevin Monte Carlo with Inexact Gradients. (arXiv:2311.05025v1 [stat.CO])
    We present an unbiased method for Bayesian posterior means based on kinetic Langevin dynamics that combines advanced splitting methods with enhanced gradient approximations. Our approach avoids Metropolis correction by coupling Markov chains at different discretization levels in a multilevel Monte Carlo approach. Theoretical analysis demonstrates that our proposed estimator is unbiased, attains finite variance, and satisfies a central limit theorem. It can achieve accuracy $\epsilon>0$ for estimating expectations of Lipschitz functions in $d$ dimensions with $\mathcal{O}(d^{1/4}\epsilon^{-2})$ expected gradient evaluations, without assuming warm start. We exhibit similar bounds using both approximate and stochastic gradients, and our method's computational cost is shown to scale logarithmically with the size of the dataset. The proposed method is tested using a multinomial regression problem on the MNIST dataset and a Poisson regression model for soccer scores. Experiments indicate that the number of gradient evaluations per effective sample is independent of dimension, even when using inexact gradients. For product distributions, we give dimension-independent variance bounds. Our results demonstrate that the unbiased algorithm we present can be much more efficient than the ``gold-standard" randomized Hamiltonian Monte Carlo.  ( 2 min )

  • Open

    "Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations", Hong et al 2023 (offline RL: IQL for training LLMs to plan by simulating humans)
    submitted by /u/gwern [link] [comments]
    "ΨPO: A General Theoretical Paradigm to Understand Learning from Human Preferences", Azar et al 2023 {DM}
    submitted by /u/gwern [link] [comments]
    Looking for Tips and Pitfalls to Avoid for Teaching Robotic Movement with RL
    Hey Everyone, I want to teach a 6 legged hexapod robot how to walk from scratch with reinforcement learning. For the last couple of months, I've looked into doing this with a URDF file and Nvidia's IsaacGym, but I've been struggling a lot trying to implement it without errors on their examples and I've moved to just using it as a simulator and wrapping that simulator around a stable baselines gym to start training. That said, I wanted to reach out here and hear from all of you about your success and failure stories with similar problems. What has been the best digital simulator for you? What about models? What hasn't worked for you? Any tips at all would be appreciated! ​ Thanks! submitted by /u/bbri826 [link] [comments]
    Reinforcement learning for signal optimization
    I have maybe an unconventional application. I am using reinforcement learning for the optimization of a signal over time. The environment is a simulation that maps a spectrum to the signal, the reward depnds on the difference to a target spectrum. Now the signal consists of lets say 1000 points. And I want the agent to decide whether to increase or decrease the amplitude of a time point or stay the same. Therefore I use a multidiscrete action space which is a vector with the size 1000 with value 3 for the possible actions. I have set up a gym env with the simulation and now I want to use a DeepNN. However DQN is not suited for multidiscrete spaces. The question: Can you recommend a DeepNN for working with a gym environment and a multidiscrete action space? Thank you very much for any input! submitted by /u/marinetankguy2 [link] [comments]
  • Open

    Best Neural Networks Courses on Udemy to Consider
    submitted by /u/Lakshmireddys [link] [comments]
    Есть ли право у ии использовать чужие логотипы?
    Попросив Bing image creater создать изображения ноутбука с chrome dino - я получил это, ноутбук, на котором есть логотип chrome, и нарушает ли это авторское право Google?🧐 submitted by /u/logic-juggler [link] [comments]
  • Open

    [D] Searching for first AI online course
    TL:DR I'm searching for backup/archive of https://www.ai-class.com/ video lectures Recently while learning about new developments in this space I got a flashback to early 2010s when I took my first online course at https://www.ai-class.com/. I remember very little about course itself, but it remember it sparked my interest in AI. Sadly I can't find original video lectures. It would be very interesting to see how AI education material looked before AlexNet and first MOOCs in general. And of course dose of nostalgia :) I remember that material moved to Udacity and it looks like updated version is this https://learn.udacity.com/courses/cs271, but videos are newer. Also there is archive.org version https://web.archive.org/web/20130415213831/https://www.ai-class.com/ but I can't get video from there. So if anyone has ideas where I could find its backup/archive please share them. submitted by /u/Kablys [link] [comments]  ( 9 min )
    [D] 2024 ICLR Blog Track
    Looks like the ICLR Blog Track https://iclr-blogposts.github.io/2024/about/ is happening again this year. What do you guys think of these alternate-format science publications (i.e. not PDFs)? submitted by /u/DuchessOfNull [link] [comments]  ( 9 min )
    [D] Seeking Guidance on Implementing Self-Distillation with DinoV2: Any Code Resources?
    Hey fellow Redditors, I'm currently exploring self-distillation techniques with DinoV2 and could use some guidance. I'm wondering if anyone has experience or resources related to implementing self-distillation using DinoV2? Specifically, I'm interested in finding code examples or repositories that provide a solid foundation for this task. I've searched online but haven't come across anything comprehensive. If you've worked on a similar project or know of any code repositories that cover self-distillation with DinoV2, could you please share them with me? I'd greatly appreciate it! Additionally, any insights, tips, or best practices you can share on self-distillation with DinoV2 would be incredibly helpful. Thanks in advance for your assistance! submitted by /u/MarioAvolio [link] [comments]  ( 9 min )
    [R] Scalable autoencoder recommender via cheap approximate inverse
    We created a scalable extension of a popular collaborative filtering model EASE, a high-dimensional linear autoencoder. To train EASE, one needs to invert a symmetric positive definite n x n matrix, where n=#items, which is computationally very expensive when the item catalog is large (tens of thousand can be a lot). We observe that when the number of items is very large, we can cheaply find a sparse approximation of the inverse (which would find the dominant entries quite accurately, and ignore the rest). This way, our method extends EASE to million item catalogs. You can find our paper here, and feel free to try our model, the code is open source on GitHub. I’d be happy to discuss it further with anyone interested :) submitted by /u/maspest_masp [link] [comments]  ( 9 min )
    [R] Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer
    Paper: https://arxiv.org/abs/2310.12442 Abstract: Pretrained transformer models have demonstrated remarkable performance across various natural language processing tasks. These models leverage the attention mechanism to capture long- and short-range dependencies in the sequence. However, the (full) attention mechanism incurs high computational cost - quadratic in the sequence length, which is not affordable in tasks with long sequences, e.g., inputs with 8k tokens. Although sparse attention can be used to improve computational efficiency, as suggested in existing work, it has limited modeling capacity and often fails to capture complicated dependencies in long sequences. To tackle this challenge, we propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans. Specifically, MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers. For the remaining layers, MASformer only employs sparse attention to capture short-range dependencies. Our experiments on natural language modeling and generation tasks show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention while significantly reducing computational cost (up to 75%). Additionally, we investigate the effectiveness of continual training with long sequence data and how sequence length impacts downstream generation performance, which may be of independent interest. https://preview.redd.it/tvhrce93kkzb1.png?width=995&format=png&auto=webp&s=d866c815f4b2f764e00acf771a175c71a281f9ed https://preview.redd.it/9ytoaf93kkzb1.png?width=1109&format=png&auto=webp&s=2a3ddcf420511bdc9ee6d2df3b77bf5970966bd8 submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] Grad Proposal Ideas for Nvidia Nemo and AWS Services
    Hello, I am a fourth year student currently in my graduation project, I need assistance in creating a project that's about a ticket based management system that involves NVidia Nemo (NLP) conversational AI and AWS Services. The goal is to provide a AI powered (TBMS) that not only comprehends user queries swiftly and accurately, but also be able to learns from its conversations with the user. This project will be aimed for tele sales companies, as it is an involving industry and in much need of that support. By using Nemo, we can integrate text to speech or vice versa so the interaction with the system is much more convenient for the user and agent. So My question is how can I implement this into real life and if anyone has any more ideas on how I could implement it into the project, that would be greatly appreciated. submitted by /u/Infamous-Aardvark357 [link] [comments]  ( 9 min )
    [D] LLM for quality assessment of journal articles
    I'm looking for suggestions for how I might get a pre-trained LLM (such as GPT-4) to assess the 'quality' of academic journal articles. I appreciate that is a highly subjective task, but one that I need to do nonetheless, and any results would be taken with skepticism. Currently all I can get from GPT4 is for it to summarise the paper, but no quality assessment, and wondered how much the response could be shaped by prompt engineering, for example. I have some limited scoring data that could be used for fine tuning, but would ideally like to use that for testing instead. Any thoughts/suggestions/ideas? submitted by /u/_onemanband_ [link] [comments]  ( 9 min )
    [Project] Open source evaluations for AI Agents in web tasks
    Recently created Banana-lyzer, an open source AI Agent evaluation framework and dataset for web tasks with Playwright (And has a banana theme because why not) and would love to get feedback/support. There are a few issues with existing evals repos: Websites change overtime, are affected by latency, and may have anti bot protections. We need a system that can reliably save and deploy historic/static snapshots of websites. Standard web practices are loose and there is an abundance of different underlying ways to represent a single individual website. For an agent to best generalize, we require building a diverse dataset of websites across industries and use-cases. We have specific evaluation criteria and agent use cases focusing on structured and direct information retrieval across websites. There exists valuable web task datasets and evaluations that we'd like to unify in a single repo (Mind2Web, WebArena, etc). Read more here: https://github.com/reworkd/bananalyzer submitted by /u/asim-shrestha [link] [comments]  ( 9 min )
    [D] ICLR 2024 Paper Reviews
    ICLR 2024 paper reviews are visible on OpenReview. I thought to create a discussion thread for us to discuss any issue/complain/celebration or anything else. There is so much noise in the reviews every year. Some good work that the authors are proud of might get a low score because of the noisy system, given that ICLR is growing so large these years. We should keep in mind that the work is still valuable no matter what the score is. submitted by /u/zy415 [link] [comments]  ( 9 min )
    [D] Is there a limit to the number of papers we can submit to NeurIPS?
    I am asking for a friend. submitted by /u/correlation_hell [link] [comments]  ( 9 min )
    [D] Recommend some software for multi-class text labelling?
    We have on the order of 5000 textual reports (from one sentence to a few paragraphs each) we need to get external expert contractors (who we manage directly) to do multi-class labelling on. There are about 180 labels. It needs to be on-premise (or at least own-cloud, not SaaS) as the data is sensitive. We want the software to manage users and roles. Something like AWS Sagemaker Ground Truth would be perfect but it only supports 50 labels for multi-class labelling tasks. Can anyone recommend something suitable, free or paid? submitted by /u/ratkins [link] [comments]  ( 9 min )
    [D] How large an LLM can I train from scratch on a single A100 GPU with 80Gb memory?
    I have access to a single 80Gb A100 GPU and would like to train an LLM with GPT-like architecture from scratch. Does anyone know how to calculate the maximum model size. submitted by /u/eeeehhh [link] [comments]  ( 9 min )
    [P] I build a therapy chatbot (not another wrapper around openai API)
    I made LLM companion by fine-tuning Llama-2-13B using custom written and collected sessions in Cognitive Behavioral Therapy (CBT). Data contains conversations that illustrate how to employ CBT techniques, including cognitive restructuring and mindfulness. It is mostly focused on asking insightful questions. Note: It is not production ready product. I am testing it and gathering feedback. Use it at your own risk. It is not meant to be replacement for professional help. You can access it here: https://poe.com/LunaChat Sorry that it is on Poe but this way it was much faster than making my own mobile friendly website. Since it's LLM, it is prone to hallucinate, generate responses that might be perceived as rude, and give insensitive comments or advice. Please use it with caution. submitted by /u/tsundoku-16 [link] [comments]  ( 9 min )
    [D] Query regarding use of language models for processing content of a completely new language or to stick to vanilla LSTMs
    Hey there, I had a doubt as to how we would move forward in a situation where you have a set of sequences and you need to classify them but then all these sequences are in a format you have never seen before or something like that. in this case I think none of the pre trained models would help considering none of them have been trained to even identify the sequence. Now my question is, are there any options for me or do I go back to vanilla transformers and LSTMs ​ ​ submitted by /u/IcecreamMan_1006 [link] [comments]  ( 9 min )
    [D] what's the foundations of data modeling?
    I came up with this thought experiment today because I'm trying to get at the heart of how to approximate a function. TDLR: if you know the foundational principles of that, it's really my whole question. I thought, ok, you are given a deterministic dataset and asked to model it perfectly. Perfectly means you extract every last ounce of information out of it, you can predict the dataset with 100% accuracy and you will be given new observations to predict that are more of the same so you should be able to predict those too. You are given a magic computer to make this model with. It's infinitely fast and has infinite memory. So you have no constraints, no limitations. You can do anything, but you must do it. You must write a way to build a perfect model. You can brute force it, but it has to learn the perfect model. What do you do? What does the simplest algorithm to perfectly model the data look like? submitted by /u/Stack3 [link] [comments]  ( 9 min )
    [D] Machine learning models for audit data
    What kind of machine learning models can be applied on structured audit data ? submitted by /u/Particular-Serve-981 [link] [comments]  ( 9 min )
    [R] CogVLM: Visual Expert for Pretrained Language Models
    Paper: https://arxiv.org/abs/2311.03079 GitHub: https://github.com/THUDM/CogVLM Abstract: We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at this https URL. https://preview.redd.it/yw0lr38gegzb1.png?width=1096&format=png&auto=webp&s=23361a84319c0fcbf1e980ca74dea26cb8be325b submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] How would I go about getting a model developed to detect plants that don’t belong in an orthomosaic?
    I have been trying to figure out a good way to get better at doing spot applications of pesticides using data I generate with a drone. Currently I’m manually selecting plants in small plots to do this but I’m wanting to scale the acres I’m covering and doing it manually isn’t an option. I don’t have any experience with machine learning and would like to work with someone or pick up a system already outfitted for this sort of thing. Any help would be appreciated. submitted by /u/Col_Clucks [link] [comments]  ( 9 min )
    [D] How to use multiple LoRA adapters with ONNX and TensorRT
    I have some questions like this. I want to create many models that serve different purposes and are all based on one original model. I trained the model using LoRA and created many adapters. I want to convert all adapters to run with ONNX and TensorRT. Currently, I can only convert when merging the adapter LoRA into the original model. Is there any way I don't need to merge the adapter into the original model? submitted by /u/unknow_from_vietnam [link] [comments]  ( 9 min )
  • Open

    Build trust and safety for generative AI applications with Amazon Comprehend and LangChain
    We are witnessing a rapid increase in the adoption of large language models (LLM) that power generative AI applications across industries. LLMs are capable of a variety of tasks, such as generating creative content, answering inquiries via chatbots, generating code, and more. Organizations looking to use LLMs to power their applications are increasingly wary about data privacy to ensure trust and safety is maintained within their generative AI applications. This includes handling customers’ personally identifiable information (PII) data properly. It also includes preventing abusive and unsafe content from being propagated to LLMs and checking that data generated by LLMs follows the same principles. In this post, we discuss new features powered by Amazon Comprehend that enable seamless integration to ensure data privacy, content safety, and prompt safety in new and existing generative AI applications.  ( 13 min )
    Use machine learning without writing a single line of code with Amazon SageMaker Canvas
    In the recent past, using machine learning (ML) to make predictions, especially for data in the form of text and images, required extensive ML knowledge for creating and tuning of deep learning models. Today, ML has become more accessible to any user who wants to use ML models to generate business value. With Amazon SageMaker […]  ( 7 min )
    Explore advanced techniques for hyperparameter optimization with Amazon SageMaker Automatic Model Tuning
    Creating high-performance machine learning (ML) solutions relies on exploring and optimizing training parameters, also known as hyperparameters. Hyperparameters are the knobs and levers that we use to adjust the training process, such as learning rate, batch size, regularization strength, and others, depending on the specific model and task at hand. Exploring hyperparameters involves systematically varying […]  ( 20 min )
  • Open

    AI could cause 'catastrophic' financial crisis – Yuval Noah Harari
    Artificial intelligence (AI) could potentially cause a catastrophic financial crisis, according to historian Yuval Noah Harari. Harari explains that the complexity of AI makes it difficult to predict and understand all the potential dangers and problems it could create. He highlights the need for global cooperation to address the risks of AI and emphasizes the importance of safety testing and regulation. Harari also raises concerns about AI's impact on the financial sector, suggesting that AI could create financial devices that are even more complex than those that caused the 2007-08 financial crisis. He calls for the establishment of powerful regulatory institutions that can react quickly to the dangers of AI as they arise. Harari suggests that the focus should be on hiring experts who understand AI's potential impact on the world of finance. He mentions the recent establishment of AI safety institutes in the UK and the US as steps towards testing advanced AI models. Harari believes that regulatory institutions should be able to identify and react to the dangers of AI without relying on lengthy and outdated regulations. He emphasizes the role of financial regulators in overseeing AI and finance, as they understand the risks in their sectors. Source : https://www.theguardian.com/technology/2023/nov/09/yuval-noah-harari-artificial-intelligence-ai-cause-financial-crisis submitted by /u/NuseAI [link] [comments]
    Robots to the Rescue - High-Tech Helpers
    submitted by /u/donutloop [link] [comments]
    AI — weekly megathread!
    News provided by aibrews.com OpenAI’s DevDay announcements [Details: [1] and [2], Keynote Video]: New GPT-4 Turbo model: 128K context window, improved instruction following, 3x cheaper price for input tokens and a 2x cheaper price for output tokens compared to GPT-4. GPTs: Custom versions of ChatGPT that users can create and share for a specific purpose using natural language. Users can also define custom actions by making one or more APIs available to the GPT allowing GPTs to integrate external data or interact with the real-world. GPT Store: a searchable store for GPTs rolling out later this month with monetization for creators in the coming months. GPT-4 Turbo can accept images as inputs in the Chat Completions API, enabling use cases such as generating captions, analyzing real …
    The Rise of Celebrity AI and Deepfakes: Kendall Jenner to AI YoutTubers
    submitted by /u/Miserable-Win-4980 [link] [comments]
    When will cats take over the world?
    submitted by /u/Sea_Permit5660 [link] [comments]
    Hollywood's actor strike ends, but artificial intelligence and wages a 'concern' for Australian industry
    submitted by /u/Jariiari7 [link] [comments]
    Jevons’ Paradox: The Power Hungry Technology We Didn't See Coming
    submitted by /u/davinci-code [link] [comments]
    One-Minute Daily AI News 11/9/2023
    Humane officially launches the AI Pin, its OpenAI-powered wearable.[1] Picsart just launched 20+ AI tools to accelerate digital content creation.[2] Hollywood Actors Strike Ends With a Deal That Will Impact AI and Streaming for Decades.[3] California using AI to combat future wildfires.[4] Sources: [1] https://www.theverge.com/2023/11/9/23953901/humane-ai-pin-launch-date-price-openai [2] https://venturebeat.com/ai/picsart-just-launched-20-ai-tools-to-accelerate-digital-content-creation/ [3] https://www.wired.com/story/hollywood-actors-strike-ends-ai-streaming/ [4] https://www.nbcnews.com/now/video/california-using-ai-to-combat-future-wildfires-197530181843 submitted by /u/Excellent-Target-847 [link] [comments]
    AI I can train with my own art?
    Context: I'm writing a paper that involves weighing the pros and cons of regulating what people are allowed to train their AI models with for creative purposes. It's a multi-modal research project with visuals, and I want to compare the quality of standard AI and a “personally trained” AI where I control what goes into it. Or at the very least the closest I can get to it for the purpose of the paper, as someone who certainly can't just make my own. I won't need it for very long, so ease of installation is ideal, but as long as it's just doable that's fine. One for images and one for text would actually be ideal, but I'm not familiar with the full capabilities of AI right now (hence the research paper, I'm very excited to learn more) so I'm not sure what's doable. Also happy to discuss the topic if anyone is interested, though I'm sure there's plenty to read about it on this subreddit. submitted by /u/Frigginconfused [link] [comments]
    I just made a GPT called DAD! It's a digital personification of the quintessential father figure. This virtual dad offers a wide range of advice from home improvement to financial management, while maintaining a friendly, humorous personality. Check it out!
    submitted by /u/dangerpotter [link] [comments]
  • Open

    Enabling large-scale health studies for the research community
    Posted by Chintan Ghate, Software Engineer, and Diana Mincu, Research Engineer, Google Research As consumer technologies like fitness trackers and mobile phones become more widely used for health-related data collection, so does the opportunity to leverage these data pathways to study and advance our understanding of medical conditions. We have previously touched upon how our work explores the use of this technology within the context of chronic diseases, in particular multiple sclerosis (MS). This effort leverages the FDA MyStudies platform, an open-source platform used to create clinical study apps, that makes it easier for anyone to run their own studies and collect good quality healthcare data, in a trusted and safe way. Today, we describe the setup that we developed by expan…  ( 93 min )
  • Open

    Mastering data management: Efficient strategies for success
    Effective data management is critical to any organization’s success in today’s data-driven world. With the exponential growth of data, businesses need to implement robust strategies to collect, store, process, and utilize data efficiently. In this blog, we will explore the best data management strategies to help your organization harness the power of data for informed… Read More »Mastering data management: Efficient strategies for success The post Mastering data management: Efficient strategies for success appeared first on Data Science Central.  ( 22 min )
  • Open

    Scroll Back in Time: AI Deciphers Ancient Roman Riddles
    Thanks to a viral trend sweeping social media, we now know some men think about the Roman Empire every day. And thanks to Luke Farritor, a 21-year-old computer science undergrad at the University of Nebraska-Lincoln, and like-minded AI enthusiasts, there might soon be a lot more to think about. Blending a passion for history with Read article >  ( 6 min )
    Scroll Back in Time: AI Deciphers Ancient Roman Riddles
    Thanks to a viral trend sweeping social media, we now know some men think about the Roman Empire every day. And thanks to Luke Farritor, a 21-year-old computer science undergrad at the University of Nebraska-Lincoln, and like-minded AI enthusiasts, there might soon be a lot more to think about. Blending a passion for history with Read article >  ( 6 min )
  • Open

    PRECISE Seminar Talk “Evaluating and Calibrating AI Models with Uncertain Ground Truth”
    I had the pleasure to present our work on evaluating and calibrating with uncertain ground truth at the seminar series of the PRECISE center at the University of Pennsylvania. Besides talking about our recent papers on evaluating AI models in health with uncertain ground truth and conformal prediction with uncertain ground truth, I also got to learn more about the research at PRECISE through post-doc and student presentations. In this article, I want to share the corresponding slides. The post PRECISE Seminar Talk “Evaluating and Calibrating AI Models with Uncertain Ground Truth” appeared first on David Stutz.  ( 4 min )
  • Open

    AnglE-optimized Text Embeddings. (arXiv:2309.12871v6 [cs.CL] UPDATED)
    High-quality text embedding is pivotal in improving semantic textual similarity (STS) tasks, which are crucial components in Large Language Model (LLM) applications. However, a common challenge existing text embedding models face is the problem of vanishing gradients, primarily due to their reliance on the cosine function in the optimization objective, which has saturation zones. To address this issue, this paper proposes a novel angle-optimized text embedding model called AnglE. The core idea of AnglE is to introduce angle optimization in a complex space. This novel approach effectively mitigates the adverse effects of the saturation zone in the cosine function, which can impede gradient and hinder optimization processes. To set up a comprehensive STS evaluation, we experimented on existing short-text STS datasets and a newly collected long-text STS dataset from GitHub Issues. Furthermore, we examine domain-specific STS scenarios with limited labeled data and explore how AnglE works with LLM-annotated data. Extensive experiments were conducted on various tasks including short-text STS, long-text STS, and domain-specific STS tasks. The results show that AnglE outperforms the state-of-the-art (SOTA) STS models that ignore the cosine saturation zone. These findings demonstrate the ability of AnglE to generate high-quality text embeddings and the usefulness of angle optimization in STS.  ( 2 min )
    The Geometric Structure of Fully-Connected ReLU Layers. (arXiv:2310.03482v2 [cs.LG] UPDATED)
    We formalize and interpret the geometric structure of $d$-dimensional fully connected ReLU layers in neural networks. The parameters of a ReLU layer induce a natural partition of the input domain, such that the ReLU layer can be significantly simplified in each sector of the partition. This leads to a geometric interpretation of a ReLU layer as a projection onto a polyhedral cone followed by an affine transformation, in line with the description in [doi:10.48550/arXiv.1905.08922] for convolutional networks with ReLU activations. Further, this structure facilitates simplified expressions for preimages of the intersection between partition sectors and hyperplanes, which is useful when describing decision boundaries in a classification setting. We investigate this in detail for a feed-forward network with one hidden ReLU-layer, where we provide results on the geometric complexity of the decision boundary generated by such networks, as well as proving that modulo an affine transformation, such a network can only generate $d$ different decision boundaries. Finally, the effect of adding more layers to the network is discussed.  ( 2 min )
    Federated Foundation Models: Privacy-Preserving and Collaborative Learning for Large Models. (arXiv:2305.11414v2 [cs.LG] UPDATED)
    Foundation Models (FMs), such as LLaMA, BERT, GPT, ViT, and CLIP, have demonstrated remarkable success in a wide range of applications, driven by their ability to leverage vast amounts of data for pre-training. However, optimizing FMs often requires access to sensitive data, raising privacy concerns and limiting their applicability in many domains. In this paper, we propose the Federated Foundation Models (FFMs) paradigm, which combines the benefits of FMs and Federated Learning (FL) to enable privacy-preserving and collaborative learning across multiple end-users. We discuss the potential benefits and challenges of integrating FL into the lifespan of FMs, covering pre-training, fine-tuning, and application. We further outline potential future research avenues in FFM, including FFM pre-training, FFM fine-tuning, and federated prompt tuning, which allow the development of more personalized and context-aware models while ensuring data privacy. Moreover, we explore the possibility of continual/lifelong learning in FFMs, as increased computational power at the edge may unlock the potential for optimizing FMs using newly generated private data close to the data source. The proposed FFM concepts offer a flexible and scalable framework for training large language models in a privacy-preserving manner, setting the stage for subsequent advancements in both FM training and federated learning.  ( 2 min )
    Robust Graph Clustering via Meta Weighting for Noisy Graphs. (arXiv:2311.00322v2 [cs.LG] UPDATED)
    How can we find meaningful clusters in a graph robustly against noise edges? Graph clustering (i.e., dividing nodes into groups of similar ones) is a fundamental problem in graph analysis with applications in various fields. Recent studies have demonstrated that graph neural network (GNN) based approaches yield promising results for graph clustering. However, we observe that their performance degenerates significantly on graphs with noise edges, which are prevalent in practice. In this work, we propose MetaGC for robust GNN-based graph clustering. MetaGC employs a decomposable clustering loss function, which can be rephrased as a sum of losses over node pairs. We add a learnable weight to each node pair, and MetaGC adaptively adjusts the weights of node pairs using meta-weighting so that the weights of meaningful node pairs increase and the weights of less-meaningful ones (e.g., noise edges) decrease. We show empirically that MetaGC learns weights as intended and consequently outperforms the state-of-the-art GNN-based competitors, even when they are equipped with separate denoising schemes, on five real-world graphs under varying levels of noise. Our code and datasets are available at https://github.com/HyeonsooJo/MetaGC.  ( 2 min )
    Chain-of-Thought Reasoning is a Policy Improvement Operator. (arXiv:2309.08589v2 [cs.LG] UPDATED)
    Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-generated training data. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), a proof-of-concept demonstration that language models can teach themselves new skills using chain-of-thought reasoning. During the self-learning loop, SECToR asks models to solve addition problems using chain-of-thought reasoning before training the next version of the model to solve those same problems directly without using such reasoning. This process often results in an improved model which can, when again augmented with chain-of-thought reasoning, solve even harder problems than the original model, allowing the self-learning loop to continue. Language models trained via SECToR autonomously learn to add up to the longest-length-digit numbers without access to any ground truth examples beyond an initial supervised fine-tuning phase consisting only of numbers with 6 or fewer digits. Our central hypothesis is that chain-of-thought reasoning can act as a policy improvement operator, similarly to how Monte-Carlo Tree Search is used in AlphaZero (Silver et al., 2017). We hope that this research can lead to new directions in which language models can learn to teach themselves without the need for human demonstrations.  ( 2 min )
    Stochastic Thermodynamics of Learning Generative Models. (arXiv:2310.19802v3 [cs.LG] UPDATED)
    We have formulated generative machine learning problems as the time evolution of Parametric Probabilistic Models (PPMs), inherently rendering a thermodynamic process. Then, we have studied the thermodynamic exchange between the model's parameters, denoted as $\Theta$, and the model's generated samples, denoted as $X$. We demonstrate that the training dataset and the action of the Stochastic Gradient Descent (SGD) optimizer serve as a work source that governs the time evolution of these two subsystems. Our findings reveal that the model learns through the dissipation of heat during the generation of samples $X$, leading to an increase in the entropy of the model's parameters, $\Theta$. Thus, the parameter subsystem acts as a heat reservoir, effectively storing the learned information. Furthermore, the role of the model's parameters as a heat reservoir provides valuable thermodynamic insights into the generalization power of over-parameterized models. This approach offers an unambiguous framework for computing information-theoretic quantities within deterministic neural networks by establishing connections with thermodynamic variables. To illustrate the utility of this framework, we introduce two information-theoretic metrics: Memorized-information (M-info) and Learned-information (L-info), which trace the dynamic flow of information during the learning process of PPMs.  ( 2 min )
    SEMQA: Semi-Extractive Multi-Source Question Answering. (arXiv:2311.04886v1 [cs.CL])
    Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge. In this work, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer, while mixing factual quoted spans -- copied verbatim from given input sources -- and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of QuoteSum for developing and studying such consolidation capabilities.  ( 2 min )
    Quality-Diversity through AI Feedback. (arXiv:2310.13032v3 [cs.CL] UPDATED)
    In many text-generation problems, users may prefer not only a single response, but a diverse range of high-quality outputs from which to choose. Quality-diversity (QD) search algorithms aim at such outcomes, by continually improving and diversifying a population of candidates. However, the applicability of QD to qualitative domains, like creative writing, has been limited by the difficulty of algorithmically specifying measures of quality and diversity. Interestingly, recent developments in language models (LMs) have enabled guiding search through AI feedback, wherein LMs are prompted in natural language to evaluate qualitative aspects of text. Leveraging this development, we introduce Quality-Diversity through AI Feedback (QDAIF), wherein an evolutionary algorithm applies LMs to both generate variation and evaluate the quality and diversity of candidate text. When assessed on creative writing domains, QDAIF covers more of a specified search space with high-quality samples than do non-QD controls. Further, human evaluation of QDAIF-generated creative texts validates reasonable agreement between AI and human evaluation. Our results thus highlight the potential of AI feedback to guide open-ended search for creative and original solutions, providing a recipe that seemingly generalizes to many domains and modalities. In this way, QDAIF is a step towards AI systems that can independently search, diversify, evaluate, and improve, which are among the core skills underlying human society's capacity for innovation.  ( 3 min )
    Solution of FPK Equation for Stochastic Dynamics Subjected to Additive Gaussian Noise via Deep Learning Approach. (arXiv:2311.04511v1 [cs.LG])
    The Fokker-Plank-Kolmogorov (FPK) equation is an idealized model representing many stochastic systems commonly encountered in the analysis of stochastic structures as well as many other applications. Its solution thus provides an invaluable insight into the performance of many engineering systems. Despite its great importance, the solution of the FPK equation is still extremely challenging. For systems of practical significance, the FPK equation is usually high dimensional, rendering most of the numerical methods ineffective. In this respect, the present work introduces the FPK-DP Net as a physics-informed network that encodes the physical insights, i.e. the governing constrained differential equations emanated out of physical laws, into a deep neural network. FPK-DP Net is a mesh-free learning method that can solve the density evolution of stochastic dynamics subjected to additive white Gaussian noise without any prior simulation data and can be used as an efficient surrogate model afterward. FPK-DP Net uses the dimension-reduced FPK equation. Therefore, it can be used to address high-dimensional practical problems as well. To demonstrate the potential applicability of the proposed framework, and to study its accuracy and efficacy, numerical implementations on five different benchmark problems are investigated.  ( 3 min )
    Deep learning as a tool for quantum error reduction in quantum image processing. (arXiv:2311.04575v1 [quant-ph])
    Despite the limited availability and quantum volume of quantum computers, quantum image representation is a widely researched area. Currently developed methods use quantum entanglement to encode information about pixel positions. These methods range from using the angle parameter of the rotation gate (e.g., the Flexible Representation of Quantum Images, FRQI), sequences of qubits (e.g., Novel Enhanced Quantum Representation, NEQR), or the angle parameter of the phase shift gates (e.g., Local Phase Image Quantum Encoding, LPIQE) for storing color information. All these methods are significantly affected by decoherence and other forms of quantum noise, which is an inseparable part of quantum computing in the noisy intermediate-scale quantum era. These phenomena can highly influence the measurements and result in extracted images that are visually dissimilar to the originals. Because this process is at its foundation quantum, the computational reversal of this process is possible. There are many methods for error correction, mitigation, and reduction, but all of them use quantum computer time or additional qubits to achieve the desired result. We report the successful use of a generative adversarial network trained for image-to-image translation, in conjunction with Phase Distortion Unraveling error reduction method, for reducing overall error in images encoded using LPIQE.  ( 2 min )
    Uncertainty Quantification for Eosinophil Segmentation. (arXiv:2309.16536v2 [eess.IV] UPDATED)
    Eosinophilic Esophagitis (EoE) is an allergic condition increasing in prevalence. To diagnose EoE, pathologists must find 15 or more eosinophils within a single high-power field (400X magnification). Determining whether or not a patient has EoE can be an arduous process and any medical imaging approaches used to assist diagnosis must consider both efficiency and precision. We propose an improvement of Adorno et al's approach for quantifying eosinphils using deep image segmentation. Our new approach leverages Monte Carlo Dropout, a common approach in deep learning to reduce overfitting, to provide uncertainty quantification on current deep learning models. The uncertainty can be visualized in an output image to evaluate model performance, provide insight to how deep learning algorithms function, and assist pathologists in identifying eosinophils.  ( 2 min )
    Be Careful When Evaluating Explanations Regarding Ground Truth. (arXiv:2311.04813v1 [cs.CV])
    Evaluating explanations of image classifiers regarding ground truth, e.g. segmentation masks defined by human perception, primarily evaluates the quality of the models under consideration rather than the explanation methods themselves. Driven by this observation, we propose a framework for $\textit{jointly}$ evaluating the robustness of safety-critical systems that $\textit{combine}$ a deep neural network with an explanation method. These are increasingly used in real-world applications like medical image analysis or robotics. We introduce a fine-tuning procedure to (mis)align model$\unicode{x2013}$explanation pipelines with ground truth and use it to quantify the potential discrepancy between worst and best-case scenarios of human alignment. Experiments across various model architectures and post-hoc local interpretation methods provide insights into the robustness of vision transformers and the overall vulnerability of such AI systems to potential adversarial attacks.  ( 2 min )
    Learning to Select SAT Encodings for Pseudo-Boolean and Linear Integer Constraints. (arXiv:2307.09342v2 [cs.AI] UPDATED)
    Many constraint satisfaction and optimisation problems can be solved effectively by encoding them as instances of the Boolean Satisfiability problem (SAT). However, even the simplest types of constraints have many encodings in the literature with widely varying performance, and the problem of selecting suitable encodings for a given problem instance is not trivial. We explore the problem of selecting encodings for pseudo-Boolean and linear constraints using a supervised machine learning approach. We show that it is possible to select encodings effectively using a standard set of features for constraint problems; however we obtain better performance with a new set of features specifically designed for the pseudo-Boolean and linear constraints. In fact, we achieve good results when selecting encodings for unseen problem classes. Our results compare favourably to AutoFolio when using the same feature set. We discuss the relative importance of instance features to the task of selecting the best encodings, and compare several variations of the machine learning method.  ( 2 min )
    FetMRQC: an open-source machine learning framework for multi-centric fetal brain MRI quality control. (arXiv:2311.04780v1 [eess.IV])
    Fetal brain MRI is becoming an increasingly relevant complement to neurosonography for perinatal diagnosis, allowing fundamental insights into fetal brain development throughout gestation. However, uncontrolled fetal motion and heterogeneity in acquisition protocols lead to data of variable quality, potentially biasing the outcome of subsequent studies. We present FetMRQC, an open-source machine-learning framework for automated image quality assessment and quality control that is robust to domain shifts induced by the heterogeneity of clinical data. FetMRQC extracts an ensemble of quality metrics from unprocessed anatomical MRI and combines them to predict experts' ratings using random forests. We validate our framework on a pioneeringly large and diverse dataset of more than 1600 manually rated fetal brain T2-weighted images from four clinical centers and 13 different scanners. Our study shows that FetMRQC's predictions generalize well to unseen data while being interpretable. FetMRQC is a step towards more robust fetal brain neuroimaging, which has the potential to shed new insights on the developing human brain.  ( 3 min )
    Hierarchically Gated Recurrent Neural Network for Sequence Modeling. (arXiv:2311.04823v1 [cs.CL])
    Transformers have surpassed RNNs in popularity due to their superior abilities in parallel training and long-term dependency modeling. Recently, there has been a renewed interest in using linear RNNs for efficient sequence modeling. These linear RNNs often employ gating mechanisms in the output of the linear recurrence layer while ignoring the significance of using forget gates within the recurrence. In this paper, we propose a gated linear RNN model dubbed Hierarchically Gated Recurrent Neural Network (HGRN), which includes forget gates that are lower bounded by a learnable value. The lower bound increases monotonically when moving up layers. This allows the upper layers to model long-term dependencies and the lower layers to model more local, short-term dependencies. Experiments on language modeling, image classification, and long-range arena benchmarks showcase the efficiency and effectiveness of our proposed model. The source code is available at https://github.com/OpenNLPLab/HGRN.  ( 2 min )
    Explainable AI for Earth Observation: Current Methods, Open Challenges, and Opportunities. (arXiv:2311.04491v1 [cs.AI])
    Deep learning has taken by storm all fields involved in data analysis, including remote sensing for Earth observation. However, despite significant advances in terms of performance, its lack of explainability and interpretability, inherent to neural networks in general since their inception, remains a major source of criticism. Hence it comes as no surprise that the expansion of deep learning methods in remote sensing is being accompanied by increasingly intensive efforts oriented towards addressing this drawback through the exploration of a wide spectrum of Explainable Artificial Intelligence techniques. This chapter, organized according to prominent Earth observation application fields, presents a panorama of the state-of-the-art in explainable remote sensing image analysis.  ( 2 min )
    Distributed Agent-Based Collaborative Learning in Cross-Individual Wearable Sensor-Based Human Activity Recognition. (arXiv:2311.04236v1 [eess.SP])
    The rapid growth of wearable sensor technologies holds substantial promise for the field of personalized and context-aware Human Activity Recognition. Given the inherently decentralized nature of data sources within this domain, the utilization of multi-agent systems with their inherent decentralization capabilities presents an opportunity to facilitate the development of scalable, adaptable, and privacy-conscious methodologies. This paper introduces a collaborative distributed learning approach rooted in multi-agent principles, wherein individual users of sensor-equipped devices function as agents within a distributed network, collectively contributing to the comprehensive process of learning and classifying human activities. In this proposed methodology, not only is the privacy of activity monitoring data upheld for each individual, eliminating the need for an external server to oversee the learning process, but the system also exhibits the potential to surmount the limitations of conventional centralized models and adapt to the unique attributes of each user. The proposed approach has been empirically tested on two publicly accessible human activity recognition datasets, specifically PAMAP2 and HARTH, across varying settings. The provided empirical results conclusively highlight the efficacy of inter-individual collaborative learning when contrasted with centralized configurations, both in terms of local and global generalization.
    Online Learning Quantum States with the Logarithmic Loss via VB-FTRL. (arXiv:2311.04237v1 [quant-ph])
    Online learning quantum states with the logarithmic loss (LL-OLQS) is a quantum generalization of online portfolio selection, a classic open problem in the field of online learning for over three decades. The problem also emerges in designing randomized optimization algorithms for maximum-likelihood quantum state tomography. Recently, Jezequel et al. (arXiv:2209.13932) proposed the VB-FTRL algorithm, the first nearly regret-optimal algorithm for OPS with moderate computational complexity. In this note, we generalize VB-FTRL for LL-OLQS. Let $d$ denote the dimension and $T$ the number of rounds. The generalized algorithm achieves a regret rate of $O ( d^2 \log ( d + T ) )$ for LL-OLQS. Each iteration of the algorithm consists of solving a semidefinite program that can be implemented in polynomial time by, e.g., cutting-plane methods. For comparison, the best-known regret rate for LL-OLQS is currently $O ( d^2 \log T )$, achieved by the exponential weight method. However, there is no explicit implementation available for the exponential weight method for LL-OLQS. To facilitate the generalization, we introduce the notion of VB-convexity. VB-convexity is a sufficient condition for the logarithmic barrier associated with any function to be convex and is of independent interest.
    Solving Kernel Ridge Regression with Gradient-Based Optimization Methods. (arXiv:2306.16838v3 [stat.ML] UPDATED)
    Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. Here, we introduce an equivalent formulation of the objective function of KRR, opening up both for using penalties other than the ridge penalty and for studying kernel ridge regression from the perspective of gradient descent. Using a continuous-time perspective, we derive a closed-form solution for solving kernel regression with gradient descent, something we refer to as kernel gradient flow, KGF, and theoretically bound the differences between KRR and KGF, where, for the latter, regularization is obtained through early stopping. We also generalize KRR by replacing the ridge penalty with the $\ell_1$ and $\ell_\infty$ penalties, respectively, and use the fact that analogous to the similarities between KGF and KRR, $\ell_1$ regularization and forward stagewise regression (also known as coordinate descent), and $\ell_\infty$ regularization and sign gradient descent, follow similar solution paths. We can thus alleviate the need for computationally heavy algorithms based on proximal gradient descent. We show theoretically and empirically how the $\ell_1$ and $\ell_\infty$ penalties, and the corresponding gradient-based optimization algorithms, produce sparse and robust kernel regression solutions, respectively.
    More PAC-Bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime-validity. (arXiv:2306.12214v2 [stat.ML] UPDATED)
    In this paper, we present new high-probability PAC-Bayes bounds for different types of losses. Firstly, for losses with a bounded range, we recover a strengthened version of Catoni's bound that holds uniformly for all parameter values. This leads to new fast rate and mixed rate bounds that are interpretable and tighter than previous bounds in the literature. In particular, the fast rate bound is equivalent to the Seeger--Langford bound. Secondly, for losses with more general tail behaviors, we introduce two new parameter-free bounds: a PAC-Bayes Chernoff analogue when the loss' cumulative generating function is bounded, and a bound when the loss' second moment is bounded. These two bounds are obtained using a new technique based on a discretization of the space of possible events for the "in probability" parameter optimization problem. This technique is both simpler and more general than previous approaches optimizing over a grid on the parameters' space. Finally, we extend all previous results to anytime-valid bounds using a simple technique applicable to any existing bound.
    Spuriosity Didn't Kill the Classifier: Using Invariant Predictions to Harness Spurious Features. (arXiv:2307.09933v2 [cs.LG] UPDATED)
    To avoid failures on out-of-distribution data, recent works have sought to extract features that have an invariant or stable relationship with the label across domains, discarding "spurious" or unstable features whose relationship with the label changes across domains. However, unstable features often carry complementary information that could boost performance if used correctly in the test domain. In this work, we show how this can be done without test-domain labels. In particular, we prove that pseudo-labels based on stable features provide sufficient guidance for doing so, provided that stable and unstable features are conditionally independent given the label. Based on this theoretical insight, we propose Stable Feature Boosting (SFB), an algorithm for: (i) learning a predictor that separates stable and conditionally-independent unstable features; and (ii) using the stable-feature predictions to adapt the unstable-feature predictions in the test domain. Theoretically, we prove that SFB can learn an asymptotically-optimal predictor without test-domain labels. Empirically, we demonstrate the effectiveness of SFB on real and synthetic data.
    Efficient Beam Tree Recursion. (arXiv:2307.10779v2 [cs.LG] UPDATED)
    Beam Tree Recursive Neural Network (BT-RvNN) was recently proposed as a simple extension of Gumbel Tree RvNN and it was shown to achieve state-of-the-art length generalization performance in ListOps while maintaining comparable performance on other tasks. However, although not the worst in its kind, BT-RvNN can be still exorbitantly expensive in memory usage. In this paper, we identify the main bottleneck in BT-RvNN's memory usage to be the entanglement of the scorer function and the recursive cell function. We propose strategies to remove this bottleneck and further simplify its memory usage. Overall, our strategies not only reduce the memory usage of BT-RvNN by $10$-$16$ times but also create a new state-of-the-art in ListOps while maintaining similar performance in other tasks. In addition, we also propose a strategy to utilize the induced latent-tree node representations produced by BT-RvNN to turn BT-RvNN from a sentence encoder of the form $f:\mathbb{R}^{n \times d} \rightarrow \mathbb{R}^{d}$ into a sequence contextualizer of the form $f:\mathbb{R}^{n \times d} \rightarrow \mathbb{R}^{n \times d}$. Thus, our proposals not only open up a path for further scalability of RvNNs but also standardize a way to use BT-RvNNs as another building block in the deep learning toolkit that can be easily stacked or interfaced with other popular models such as Transformers and Structured State Space models.
    Decentralized Personalized Online Federated Learning. (arXiv:2311.04817v1 [cs.LG])
    Vanilla federated learning does not support learning in an online environment, learning a personalized model on each client, and learning in a decentralized setting. There are existing methods extending federated learning in each of the three aspects. However, some important applications on enterprise edge servers (e.g. online item recommendation at global scale) involve the three aspects at the same time. Therefore, we propose a new learning setting \textit{Decentralized Personalized Online Federated Learning} that considers all the three aspects at the same time. In this new setting for learning, the first technical challenge is how to aggregate the shared model parameters from neighboring clients to obtain a personalized local model with good performance on each client. We propose to directly learn an aggregation by optimizing the performance of the local model with respect to the aggregation weights. This not only improves personalization of each local model but also helps the local model adapting to potential data shift by intelligently incorporating the right amount of information from its neighbors. The second challenge is how to select the neighbors for each client. We propose a peer selection method based on the learned aggregation weights enabling each client to select the most helpful neighbors and reduce communication cost at the same time. We verify the effectiveness and robustness of our proposed method on three real-world item recommendation datasets and one air quality prediction dataset.
    Algorithms for Non-Negative Matrix Factorization on Noisy Data With Negative Values. (arXiv:2311.04855v1 [astro-ph.IM])
    Non-negative matrix factorization (NMF) is a dimensionality reduction technique that has shown promise for analyzing noisy data, especially astronomical data. For these datasets, the observed data may contain negative values due to noise even when the true underlying physical signal is strictly positive. Prior NMF work has not treated negative data in a statistically consistent manner, which becomes problematic for low signal-to-noise data with many negative values. In this paper we present two algorithms, Shift-NMF and Nearly-NMF, that can handle both the noisiness of the input data and also any introduced negativity. Both of these algorithms use the negative data space without clipping, and correctly recover non-negative signals without any introduced positive offset that occurs when clipping negative data. We demonstrate this numerically on both simple and more realistic examples, and prove that both algorithms have monotonically decreasing update rules.
    Computing with Residue Numbers in High-Dimensional Representation. (arXiv:2311.04872v1 [cs.NE])
    We introduce Residue Hyperdimensional Computing, a computing framework that unifies residue number systems with an algebra defined over random, high-dimensional vectors. We show how residue numbers can be represented as high-dimensional vectors in a manner that allows algebraic operations to be performed with component-wise, parallelizable operations on the vector elements. The resulting framework, when combined with an efficient method for factorizing high-dimensional vectors, can represent and operate on numerical values over a large dynamic range using vastly fewer resources than previous methods, and it exhibits impressive robustness to noise. We demonstrate the potential for this framework to solve computationally difficult problems in visual perception and combinatorial optimization, showing improvement over baseline methods. More broadly, the framework provides a possible account for the computational operations of grid cells in the brain, and it suggests new machine learning architectures for representing and manipulating numerical data.
    LoopTune: Optimizing Tensor Computations with Reinforcement Learning. (arXiv:2309.01825v3 [cs.LG] UPDATED)
    Advanced compiler technology is crucial for enabling machine learning applications to run on novel hardware, but traditional compilers fail to deliver performance, popular auto-tuners have long search times and expert-optimized libraries introduce unsustainable costs. To address this, we developed LoopTune, a deep reinforcement learning compiler that optimizes tensor computations in deep learning models for the CPU. LoopTune optimizes tensor traversal order while using the ultra-fast lightweight code generator LoopNest to perform hardware-specific optimizations. With a novel graph-based representation and action space, LoopTune speeds up LoopNest by 3.2x, generating an order of magnitude faster code than TVM, 2.8x faster than MetaSchedule, and 1.08x faster than AutoTVM, consistently performing at the level of the hand-tuned library Numpy. Moreover, LoopTune tunes code in order of seconds.
    Dissecting Chain-of-Thought: Compositionality through In-Context Filtering and Learning. (arXiv:2305.18869v2 [cs.LG] UPDATED)
    Chain-of-thought (CoT) is a method that enables language models to handle complex reasoning tasks by decomposing them into simpler steps. Despite its success, the underlying mechanics of CoT are not yet fully understood. In an attempt to shed light on this, our study investigates the impact of CoT on the ability of transformers to in-context learn a simple to study, yet general family of compositional functions: multi-layer perceptrons (MLPs). In this setting, we find that the success of CoT can be attributed to breaking down in-context learning of a compositional function into two distinct phases: focusing on and filtering data related to each step of the composition and in-context learning the single-step composition function. Through both experimental and theoretical evidence, we demonstrate how CoT significantly reduces the sample complexity of in-context learning (ICL) and facilitates the learning of complex functions that non-CoT methods struggle with. Furthermore, we illustrate how transformers can transition from vanilla in-context learning to mastering a compositional function with CoT by simply incorporating additional layers that perform the necessary data-filtering for CoT via the attention mechanism. In addition to these test-time benefits, we show CoT helps accelerate pretraining by learning shortcuts to represent complex functions and filtering plays an important role in this process. These findings collectively provide insights into the mechanics of CoT, inviting further investigation of its role in complex reasoning tasks.
    N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics. (arXiv:2310.18679v2 [cs.CL] UPDATED)
    We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination. This method involves refining model outputs through an ensemble of critics and the model's own feedback. Drawing inspiration from human behavior, we explore whether LLMs can emulate the self-correction process observed in humans who often engage in self-reflection and seek input from others to refine their understanding of complex topics. Our approach is model-agnostic and can be applied across various domains to enhance trustworthiness by addressing fairness, bias, and robustness concerns. We consistently observe performance improvements in LLMs for reducing toxicity and correcting factual errors.
    CoCA: Fusing position embedding with Collinear Constrained Attention for fine-tuning free context window extending. (arXiv:2309.08646v2 [cs.LG] UPDATED)
    Self-attention and position embedding are two key modules in Transformer based LLMs. The potential relationship among them are far from well studied, especially for context window extending. In this paper, we introduce collinear constrained relationship to fuse RoPE and self-attention, and name it as Collinear Constrained Attention (CoCA). We've analyzed the computational and spatial complexity of CoCA and have determined that it adds only minimal additional overhead compared to the original Transformer-based models. We provide an efficient implementation of CoCA, and make it drop-in replacement for any existing position embedding and attention modules in Transformer based models. Experiments show that CoCA performs extraordinary well on context window extending. For instance, a CoCA based GPT model trained with 512 context length can extend the context window up to 8K without perplexity diverging. This indicates more than 16x context window extending without any fine-tuning. Our code is released here: https://github.com/codefuse-ai/Collinear-Constrained-Attention
    Vital Sign Forecasting for Sepsis Patients in ICUs. (arXiv:2311.04770v1 [cs.LG])
    Sepsis and septic shock are a critical medical condition affecting millions globally, with a substantial mortality rate. This paper uses state-of-the-art deep learning (DL) architectures to introduce a multi-step forecasting system to predict vital signs indicative of septic shock progression in Intensive Care Units (ICUs). Our approach utilizes a short window of historical vital sign data to forecast future physiological conditions. We introduce a DL-based vital sign forecasting system that predicts up to 3 hours of future vital signs from 6 hours of past data. We further adopt the DILATE loss function to capture better the shape and temporal dynamics of vital signs, which are critical for clinical decision-making. We compare three DL models, N-BEATS, N-HiTS, and Temporal Fusion Transformer (TFT), using the publicly available eICU Collaborative Research Database (eICU-CRD), highlighting their forecasting capabilities in a critical care setting. We evaluate the performance of our models using mean squared error (MSE) and dynamic time warping (DTW) metrics. Our findings show that while TFT excels in capturing overall trends, N-HiTS is superior in retaining short-term fluctuations within a predefined range. This paper demonstrates the potential of deep learning in transforming the monitoring systems in ICUs, potentially leading to significant improvements in patient care and outcomes by accurately forecasting vital signs to assist healthcare providers in detecting early signs of physiological instability and anticipating septic shock.
    Robust Mean Estimation Without Moments for Symmetric Distributions. (arXiv:2302.10844v2 [cs.DS] UPDATED)
    We study the problem of robustly estimating the mean or location parameter without moment assumptions. We show that for a large class of symmetric distributions, the same error as in the Gaussian setting can be achieved efficiently. The distributions we study include products of arbitrary symmetric one-dimensional distributions, such as product Cauchy distributions, as well as elliptical distributions. For product distributions and elliptical distributions with known scatter (covariance) matrix, we show that given an $\varepsilon$-corrupted sample, we can with probability at least $1-\delta$ estimate its location up to error $O(\varepsilon \sqrt{\log(1/\varepsilon)})$ using $\tfrac{d\log(d) + \log(1/\delta)}{\varepsilon^2 \log(1/\varepsilon)}$ samples. This result matches the best-known guarantees for the Gaussian distribution and known SQ lower bounds (up to the $\log(d)$ factor). For elliptical distributions with unknown scatter (covariance) matrix, we propose a sequence of efficient algorithms that approaches this optimal error. Specifically, for every $k \in \mathbb{N}$, we design an estimator using time and samples $\tilde{O}({d^k})$ achieving error $O(\varepsilon^{1-\frac{1}{2k}})$. This matches the error and running time guarantees when assuming certifiably bounded moments of order up to $k$. For unknown covariance, such error bounds of $o(\sqrt{\varepsilon})$ are not even known for (general) sub-Gaussian distributions. Our algorithms are based on a generalization of the well-known filtering technique. We show how this machinery can be combined with Huber-loss-based techniques to work with projections of the noise that behave more nicely than the initial noise. Moreover, we show how SoS proofs can be used to obtain algorithmic guarantees even for distributions without a first moment. We believe that this approach may find other applications in future works.
    Concept-Centric Transformers: Enhancing Model Interpretability through Object-Centric Concept Learning within a Shared Global Workspace. (arXiv:2305.15775v3 [cs.LG] UPDATED)
    Many interpretable AI approaches have been proposed to provide plausible explanations for a model's decision-making. However, configuring an explainable model that effectively communicates among computational modules has received less attention. A recently proposed shared global workspace theory showed that networks of distributed modules can benefit from sharing information with a bottlenecked memory because the communication constraints encourage specialization, compositionality, and synchronization among the modules. Inspired by this, we propose Concept-Centric Transformers, a simple yet effective configuration of the shared global workspace for interpretability, consisting of: i) an object-centric-based memory module for extracting semantic concepts from input features, ii) a cross-attention mechanism between the learned concept and input embeddings, and iii) standard classification and explanation losses to allow human analysts to directly assess an explanation for the model's classification reasoning. We test our approach against other existing concept-based methods on classification tasks for various datasets, including CIFAR100, CUB-200-2011, and ImageNet, and we show that our model achieves better classification accuracy than all baselines across all problems but also generates more consistent concept-based explanations of classification output.
    Zero-Shot Anomaly Detection via Batch Normalization. (arXiv:2302.07849v4 [cs.LG] UPDATED)
    Anomaly detection (AD) plays a crucial role in many safety-critical application domains. The challenge of adapting an anomaly detector to drift in the normal data distribution, especially when no training data is available for the "new normal," has led to the development of zero-shot AD techniques. In this paper, we propose a simple yet effective method called Adaptive Centered Representations (ACR) for zero-shot batch-level AD. Our approach trains off-the-shelf deep anomaly detectors (such as deep SVDD) to adapt to a set of inter-related training data distributions in combination with batch normalization, enabling automatic zero-shot generalization for unseen AD tasks. This simple recipe, batch normalization plus meta-training, is a highly effective and versatile tool. Our theoretical results guarantee the zero-shot generalization for unseen AD tasks; our empirical results demonstrate the first zero-shot AD results for tabular data and outperform existing methods in zero-shot anomaly detection and segmentation on image data from specialized domains. Code is at https://github.com/aodongli/zero-shot-ad-via-batch-norm
    Survival Instinct in Offline Reinforcement Learning. (arXiv:2306.03286v2 [cs.LG] UPDATED)
    We present a novel observation about the behavior of offline reinforcement learning (RL) algorithms: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design. We demonstrate that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and certain implicit biases in common data collection practices. As we prove in this work, pessimism endows the agent with a "survival instinct", i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies. Formally, given a reward class -- which may not even contain the true reward -- we identify conditions on the training data distribution that enable offline RL to learn a near-optimal and safe policy from any reward within the class. We argue that the survival instinct should be taken into account when interpreting results from existing offline RL benchmarks and when creating future ones. Our empirical and theoretical results suggest a new paradigm for RL, whereby an agent is nudged to learn a desirable behavior with imperfect reward but purposely biased data coverage.
    Federated Dataset Dictionary Learning for Multi-Source Domain Adaptation. (arXiv:2309.07670v2 [cs.LG] UPDATED)
    In this article, we propose an approach for federated domain adaptation, a setting where distributional shift exists among clients and some have unlabeled data. The proposed framework, FedDaDiL, tackles the resulting challenge through dictionary learning of empirical distributions. In our setting, clients' distributions represent particular domains, and FedDaDiL collectively trains a federated dictionary of empirical distributions. In particular, we build upon the Dataset Dictionary Learning framework by designing collaborative communication protocols and aggregation operations. The chosen protocols keep clients' data private, thus enhancing overall privacy compared to its centralized counterpart. We empirically demonstrate that our approach successfully generates labeled data on the target domain with extensive experiments on (i) Caltech-Office, (ii) TEP, and (iii) CWRU benchmarks. Furthermore, we compare our method to its centralized counterpart and other benchmarks in federated domain adaptation.
    Cross-Silo Federated Learning Across Divergent Domains with Iterative Parameter Alignment. (arXiv:2311.04818v1 [cs.LG])
    Learning from the collective knowledge of data dispersed across private sources can provide neural networks with enhanced generalization capabilities. Federated learning, a method for collaboratively training a machine learning model across remote clients, achieves this by combining client models via the orchestration of a central server. However, current approaches face two critical limitations: i) they struggle to converge when client domains are sufficiently different, and ii) current aggregation techniques produce an identical global model for each client. In this work, we address these issues by reformulating the typical federated learning setup: rather than learning a single global model, we learn N models each optimized for a common objective. To achieve this, we apply a weighted distance minimization to model parameters shared in a peer-to-peer topology. The resulting framework, Iterative Parameter Alignment, applies naturally to the cross-silo setting, and has the following properties: (i) a unique solution for each participant, with the option to globally converge each model in the federation, and (ii) an optional early-stopping mechanism to elicit fairness among peers in collaborative learning settings. These characteristics jointly provide a flexible new framework for iteratively learning from peer models trained on disparate datasets. We find that the technique achieves competitive results on a variety of data partitions compared to state-of-the-art approaches. Further, we show that the method is robust to divergent domains (i.e. disjoint classes across peers) where existing approaches struggle.
    Foundation Models for Generalist Geospatial Artificial Intelligence. (arXiv:2310.18660v2 [cs.CV] UPDATED)
    Significant progress in the development of highly adaptable and reusable Artificial Intelligence (AI) models is expected to have a significant impact on Earth science and remote sensing. Foundation models are pre-trained on large unlabeled datasets through self-supervision, and then fine-tuned for various downstream tasks with small labeled datasets. This paper introduces a first-of-a-kind framework for the efficient pre-training and fine-tuning of foundational models on extensive geospatial data. We have utilized this framework to create Prithvi, a transformer-based geospatial foundational model pre-trained on more than 1TB of multispectral satellite imagery from the Harmonized Landsat-Sentinel 2 (HLS) dataset. Our study demonstrates the efficacy of our framework in successfully fine-tuning Prithvi to a range of Earth observation tasks that have not been tackled by previous work on foundation models involving multi-temporal cloud gap imputation, flood mapping, wildfire scar segmentation, and multi-temporal crop segmentation. Our experiments show that the pre-trained model accelerates the fine-tuning process compared to leveraging randomly initialized weights. In addition, pre-trained Prithvi compares well against the state-of-the-art, e.g., outperforming a conditional GAN model in multi-temporal cloud imputation by up to 5pp (or 5.7%) in the structural similarity index. Finally, due to the limited availability of labeled data in the field of Earth observation, we gradually reduce the quantity of available labeled data for refining the model to evaluate data efficiency and demonstrate that data can be decreased significantly without affecting the model's accuracy. The pre-trained 100 million parameter model and corresponding fine-tuning workflows have been released publicly as open source contributions to the global Earth sciences community through Hugging Face.
    ToddlerBERTa: Exploiting BabyBERTa for Grammar Learning and Language Understanding. (arXiv:2308.16336v2 [cs.CL] UPDATED)
    We present ToddlerBERTa, a BabyBERTa-like language model, exploring its capabilities through five different models with varied hyperparameters. Evaluating on BLiMP, SuperGLUE, MSGS, and a Supplement benchmark from the BabyLM challenge, we find that smaller models can excel in specific tasks, while larger models perform well with substantial data. Despite training on a smaller dataset, ToddlerBERTa demonstrates commendable performance, rivalling the state-of-the-art RoBERTa-base. The model showcases robust language understanding, even with single-sentence pretraining, and competes with baselines that leverage broader contextual information. Our work provides insights into hyperparameter choices, and data utilization, contributing to the advancement of language models.
    Causal disentanglement of multimodal data. (arXiv:2310.18471v2 [cs.LG] UPDATED)
    Causal representation learning algorithms discover lower-dimensional representations of data that admit a decipherable interpretation of cause and effect; as achieving such interpretable representations is challenging, many causal learning algorithms utilize elements indicating prior information, such as (linear) structural causal models, interventional data, or weak supervision. Unfortunately, in exploratory causal representation learning, such elements and prior information may not be available or warranted. Alternatively, scientific datasets often have multiple modalities or physics-based constraints, and the use of such scientific, multimodal data has been shown to improve disentanglement in fully unsupervised settings. Consequently, we introduce a causal representation learning algorithm (causalPIMA) that can use multimodal data and known physics to discover important features with causal relationships. Our innovative algorithm utilizes a new differentiable parametrization to learn a directed acyclic graph (DAG) together with a latent space of a variational autoencoder in an end-to-end differentiable framework via a single, tractable evidence lower bound loss function. We place a Gaussian mixture prior on the latent space and identify each of the mixtures with an outcome of the DAG nodes; this novel identification enables feature discovery with causal relationships. Tested against a synthetic and a scientific dataset, our results demonstrate the capability of learning an interpretable causal structure while simultaneously discovering key features in a fully unsupervised setting.
    Functional Bayesian Tucker Decomposition for Continuous-indexed Tensor Data. (arXiv:2311.04829v1 [cs.LG])
    Tucker decomposition is a powerful tensor model to handle multi-aspect data. It demonstrates the low-rank property by decomposing the grid-structured data as interactions between a core tensor and a set of object representations (factors). A fundamental assumption of such decomposition is that there were finite objects in each aspect or mode, corresponding to discrete indexes of data entries. However, many real-world data are not naturally posed in the setting. For example, geographic data is represented as continuous indexes of latitude and longitude coordinates, and cannot fit tensor models directly. To generalize Tucker decomposition to such scenarios, we propose Functional Bayesian Tucker Decomposition (FunBaT). We treat the continuous-indexed data as the interaction between the Tucker core and a group of latent functions. We use Gaussian processes (GP) as functional priors to model the latent functions, and then convert the GPs into a state-space prior by constructing an equivalent stochastic differential equation (SDE) to reduce computational cost. An efficient inference algorithm is further developed for scalable posterior approximation based on advanced message-passing techniques. The advantage of our method is shown in both synthetic data and several real-world applications.
    Real-Time Recurrent Reinforcement Learning. (arXiv:2311.04830v1 [cs.LG])
    Recent advances in reinforcement learning, for partially-observable Markov decision processes (POMDPs), rely on the biologically implausible backpropagation through time algorithm (BPTT) to perform gradient-descent optimisation. In this paper we propose a novel reinforcement learning algorithm that makes use of random feedback local online learning (RFLO), a biologically plausible approximation of realtime recurrent learning (RTRL) to compute the gradients of the parameters of a recurrent neural network in an online manner. By combining it with TD($\lambda$), a variant of temporaldifference reinforcement learning with eligibility traces, we create a biologically plausible, recurrent actor-critic algorithm, capable of solving discrete and continuous control tasks in POMDPs. We compare BPTT, RTRL and RFLO as well as different network architectures, and find that RFLO can perform just as well as RTRL while exceeding even BPTT in terms of complexity. The proposed method, called real-time recurrent reinforcement learning (RTRRL), serves as a model of learning in biological neural networks mimicking reward pathways in the mammalian brain.
    FetMRQC: Automated Quality Control for fetal brain MRI. (arXiv:2304.05879v2 [eess.IV] UPDATED)
    Quality control (QC) has long been considered essential to guarantee the reliability of neuroimaging studies. It is particularly important for fetal brain MRI, where large and unpredictable fetal motion can lead to substantial artifacts in the acquired images. Existing methods for fetal brain quality assessment operate at the \textit{slice} level, and fail to get a comprehensive picture of the quality of an image, that can only be achieved by looking at the \textit{entire} brain volume. In this work, we propose FetMRQC, a machine learning framework for automated image quality assessment tailored to fetal brain MRI, which extracts an ensemble of quality metrics that are then used to predict experts' ratings. Based on the manual ratings of more than 1000 low-resolution stacks acquired across two different institutions, we show that, compared with existing quality metrics, FetMRQC is able to generalize out-of-domain, while being interpretable and data efficient. We also release a novel manual quality rating tool designed to facilitate and optimize quality rating of fetal brain images. Our tool, along with all the code to generate, train and evaluate the model is available at https://github.com/Medical-Image-Analysis-Laboratory/fetal_brain_qc/ .
    Constrained Adaptive Attacks: Realistic Evaluation of Adversarial Examples and Robust Training of Deep Neural Networks for Tabular Data. (arXiv:2311.04503v1 [cs.LG])
    State-of-the-art deep learning models for tabular data have recently achieved acceptable performance to be deployed in industrial settings. However, the robustness of these models remains scarcely explored. Contrary to computer vision, there is to date no realistic protocol to properly evaluate the adversarial robustness of deep tabular models due to intrinsic properties of tabular data such as categorical features, immutability, and feature relationship constraints. To fill this gap, we propose CAA, the first efficient evasion attack for constrained tabular deep learning models. CAA is an iterative parameter-free attack that combines gradient and search attacks to generate adversarial examples under constraints. We leverage CAA to build a benchmark of deep tabular models across three popular use cases: credit scoring, phishing and botnet attacks detection. Our benchmark supports ten threat models with increasing capabilities of the attacker, and reflects real-world attack scenarios for each use case. Overall, our results demonstrate how domain knowledge, adversarial training, and attack budgets impact the robustness assessment of deep tabular models and provide security practitioners with a set of recommendations to improve the robustness of deep tabular models against various evasion attack scenarios.
    When to Update Your Model: Constrained Model-based Reinforcement Learning. (arXiv:2210.08349v4 [cs.LG] UPDATED)
    Designing and analyzing model-based RL (MBRL) algorithms with guaranteed monotonic improvement has been challenging, mainly due to the interdependence between policy optimization and model learning. Existing discrepancy bounds generally ignore the impacts of model shifts, and their corresponding algorithms are prone to degrade performance by drastic model updating. In this work, we first propose a novel and general theoretical scheme for a non-decreasing performance guarantee of MBRL. Our follow-up derived bounds reveal the relationship between model shifts and performance improvement. These discoveries encourage us to formulate a constrained lower-bound optimization problem to permit the monotonicity of MBRL. A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns. Motivated by these analyses, we design a simple but effective algorithm CMLO (Constrained Model-shift Lower-bound Optimization), by introducing an event-triggered mechanism that flexibly determines when to update the model. Experiments show that CMLO surpasses other state-of-the-art methods and produces a boost when various policy optimization methods are employed.
    Efficient and Equivariant Graph Networks for Predicting Quantum Hamiltonian. (arXiv:2306.04922v2 [cs.LG] UPDATED)
    We consider the prediction of the Hamiltonian matrix, which finds use in quantum chemistry and condensed matter physics. Efficiency and equivariance are two important, but conflicting factors. In this work, we propose a SE(3)-equivariant network, named QHNet, that achieves efficiency and equivariance. Our key advance lies at the innovative design of QHNet architecture, which not only obeys the underlying symmetries, but also enables the reduction of number of tensor products by 92\%. In addition, QHNet prevents the exponential growth of channel dimension when more atom types are involved. We perform experiments on MD17 datasets, including four molecular systems. Experimental results show that our QHNet can achieve comparable performance to the state of the art methods at a significantly faster speed. Besides, our QHNet consumes 50\% less memory due to its streamlined architecture. Our code is publicly available as part of the AIRS library (\url{https://github.com/divelab/AIRS}).
    The voraus-AD Dataset for Anomaly Detection in Robot Applications. (arXiv:2311.04765v1 [cs.RO])
    During the operation of industrial robots, unusual events may endanger the safety of humans and the quality of production. When collecting data to detect such cases, it is not ensured that data from all potentially occurring errors is included as unforeseeable events may happen over time. Therefore, anomaly detection (AD) delivers a practical solution, using only normal data to learn to detect unusual events. We introduce a dataset that allows training and benchmarking of anomaly detection methods for robotic applications based on machine data which will be made publicly available to the research community. As a typical robot task the dataset includes a pick-and-place application which involves movement, actions of the end effector and interactions with the objects of the environment. Since several of the contained anomalies are not task-specific but general, evaluations on our dataset are transferable to other robotics applications as well. Additionally, we present MVT-Flow (multivariate time-series flow) as a new baseline method for anomaly detection: It relies on deep-learning-based density estimation with normalizing flows, tailored to the data domain by taking its structure into account for the architecture. Our evaluation shows that MVT-Flow outperforms baselines from previous work by a large margin of 6.2% in area under ROC.
    Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding. (arXiv:2209.05629v2 [cs.RO] UPDATED)
    Abstract semantic 3D scene understanding is a problem of critical importance in robotics. As robots still lack the common-sense knowledge about household objects and locations of an average human, we investigate the use of pre-trained language models to impart common sense for scene understanding. We introduce and compare a wide range of scene classification paradigms that leverage language only (zero-shot, embedding-based, and structured-language) or vision and language (zero-shot and fine-tuned). We find that the best approaches in both categories yield $\sim 70\%$ room classification accuracy, exceeding the performance of pure-vision and graph classifiers. We also find such methods demonstrate notable generalization and transfer capabilities stemming from their use of language.
    Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning. (arXiv:2309.06597v2 [cs.CV] UPDATED)
    The widespread adoption of commercial autonomous vehicles (AVs) and advanced driver assistance systems (ADAS) may largely depend on their acceptance by society, for which their perceived trustworthiness and interpretability to riders are crucial. In general, this task is challenging because modern autonomous systems software relies heavily on black-box artificial intelligence models. Towards this goal, this paper introduces a novel dataset, Rank2Tell, a multi-modal ego-centric dataset for Ranking the importance level and Telling the reason for the importance. Using various close and open-ended visual question answering, the dataset provides dense annotations of various semantic, spatial, temporal, and relational attributes of various important objects in complex traffic scenarios. The dense annotations and unique attributes of the dataset make it a valuable resource for researchers working on visual scene understanding and related fields. Furthermore, we introduce a joint model for joint importance level ranking and natural language captions generation to benchmark our dataset and demonstrate performance with quantitative evaluations.
    Riemannian Laplace Approximation with the Fisher Metric. (arXiv:2311.02766v2 [cs.LG] UPDATED)
    The Laplace's method approximates a target density with a Gaussian distribution at its mode. It is computationally efficient and asymptotically exact for Bayesian inference due to the Bernstein-von Mises theorem, but for complex targets and finite-data posteriors it is often too crude an approximation. A recent generalization of the Laplace Approximation transforms the Gaussian approximation according to a chosen Riemannian geometry providing a richer approximation family, while still retaining computational efficiency. However, as shown here, its properties heavily depend on the chosen metric, indeed the metric adopted in previous work results in approximations that are overly narrow as well as being biased even at the limit of infinite data. We correct this shortcoming by developing the approximation family further, deriving two alternative variants that are exact at the limit of infinite data, extending the theoretical analysis of the method, and demonstrating practical improvements in a range of experiments.
    Versatile Energy-Based Probabilistic Models for High Energy Physics. (arXiv:2302.00695v4 [cs.LG] UPDATED)
    As a classical generative modeling approach, energy-based models have the natural advantage of flexibility in the form of the energy function. Recently, energy-based models have achieved great success in modeling high-dimensional data in computer vision and natural language processing. In line with these advancements, we build a multi-purpose energy-based probabilistic model for High Energy Physics events at the Large Hadron Collider. This framework builds on a powerful generative model and describes higher-order inter-particle interactions. It suits different encoding architectures and builds on implicit generation. As for applicational aspects, it can serve as a powerful parameterized event generator for physics simulation, a generic anomalous signal detector free from spurious correlations, and an augmented event classifier for particle identification.
    Multi-Source Domain Adaptation through Dataset Dictionary Learning in Wasserstein Space. (arXiv:2307.14953v3 [cs.LG] UPDATED)
    This paper seeks to solve Multi-Source Domain Adaptation (MSDA), which aims to mitigate data distribution shifts when transferring knowledge from multiple labeled source domains to an unlabeled target domain. We propose a novel MSDA framework based on dictionary learning and optimal transport. We interpret each domain in MSDA as an empirical distribution. As such, we express each domain as a Wasserstein barycenter of dictionary atoms, which are empirical distributions. We propose a novel algorithm, DaDiL, for learning via mini-batches: (i) atom distributions; (ii) a matrix of barycentric coordinates. Based on our dictionary, we propose two novel methods for MSDA: DaDil-R, based on the reconstruction of labeled samples in the target domain, and DaDiL-E, based on the ensembling of classifiers learned on atom distributions. We evaluate our methods in 3 benchmarks: Caltech-Office, Office 31, and CRWU, where we improved previous state-of-the-art by 3.15%, 2.29%, and 7.71% in classification performance. Finally, we show that interpolations in the Wasserstein hull of learned atoms provide data that can generalize to the target domain.
    Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs. (arXiv:2311.04417v1 [cs.AR])
    The relentless advancement of artificial intelligence (AI) and machine learning (ML) applications necessitates the development of specialized hardware accelerators capable of handling the increasing complexity and computational demands. Traditional computing architectures, based on the von Neumann model, are being outstripped by the requirements of contemporary AI/ML algorithms, leading to a surge in the creation of accelerators like the Graphcore Intelligence Processing Unit (IPU), Sambanova Reconfigurable Dataflow Unit (RDU), and enhanced GPU platforms. These hardware accelerators are characterized by their innovative data-flow architectures and other design optimizations that promise to deliver superior performance and energy efficiency for AI/ML tasks. This research provides a preliminary evaluation and comparison of these commercial AI/ML accelerators, delving into their hardware and software design features to discern their strengths and unique capabilities. By conducting a series of benchmark evaluations on common DNN operators and other AI/ML workloads, we aim to illuminate the advantages of data-flow architectures over conventional processor designs and offer insights into the performance trade-offs of each platform. The findings from our study will serve as a valuable reference for the design and performance expectations of research prototypes, thereby facilitating the development of next-generation hardware accelerators tailored for the ever-evolving landscape of AI/ML applications. Through this analysis, we aspire to contribute to the broader understanding of current accelerator technologies and to provide guidance for future innovations in the field.
    Bandit Learning to Rank with Position-Based Click Models: Personalized and Equal Treatments. (arXiv:2311.04528v1 [cs.LG])
    Online learning to rank (ONL2R) is a foundational problem for recommender systems and has received increasing attention in recent years. Among the existing approaches for ONL2R, a natural modeling architecture is the multi-armed bandit framework coupled with the position-based click model. However, developing efficient online learning policies for MAB-based ONL2R with position-based click models is highly challenging due to the combinatorial nature of the problem, and partial observability in the position-based click model. To date, results in MAB-based ONL2R with position-based click models remain rather limited, which motivates us to fill this gap in this work. Our main contributions in this work are threefold: i) We propose the first general MAB framework that captures all key ingredients of ONL2R with position-based click models. Our model considers personalized and equal treatments in ONL2R ranking recommendations, both of which are widely used in practice; ii) Based on the above analytical framework, we develop two unified greed- and UCB-based policies called GreedyRank and UCBRank, each of which can be applied to personalized and equal ranking treatments; and iii) We show that both GreedyRank and UCBRank enjoy $O(\sqrt{t}\ln t)$ and $O(\sqrt{t\ln t})$ anytime sublinear regret for personalized and equal treatment, respectively. For the fundamentally hard equal ranking treatment, we identify classes of collective utility functions and their associated sufficient conditions under which $O(\sqrt{t}\ln t)$ and $O(\sqrt{t\ln t})$ anytime sublinear regrets are still achievable for GreedyRank and UCBRank, respectively. Our numerical experiments also verify our theoretical results and demonstrate the efficiency of GreedyRank and UCBRank in seeking the optimal action under various problem settings.
    LuminanceL1Loss: A loss function which measures percieved brightness and colour differences. (arXiv:2311.04614v1 [cs.CV])
    We introduce LuminanceL1Loss, a novel loss function designed to enhance the performance of image restoration tasks. We demonstrate its superiority over MSE when applied to the Retinexformer, BUIFD and DnCNN architectures. Our proposed LuminanceL1Loss leverages a unique approach by transforming images into grayscale and subsequently computing the MSE loss for both grayscale and color channels. Experimental results demonstrate that this innovative loss function consistently outperforms traditional methods, showcasing its potential in image denoising and other related tasks in image reconstruction. It demonstrates gains up to 4.7dB. The results presented in this study highlight the efficacy of LuminanceL1Loss for various image restoration tasks.
    Sample Efficient Reinforcement Learning in Mixed Systems through Augmented Samples and Its Applications to Queueing Networks. (arXiv:2305.16483v2 [cs.LG] UPDATED)
    This paper considers a class of reinforcement learning problems, which involve systems with two types of states: stochastic and pseudo-stochastic. In such systems, stochastic states follow a stochastic transition kernel while the transitions of pseudo-stochastic states are deterministic given the stochastic states/transitions. We refer to such systems as mixed systems, which are widely used in various applications, including manufacturing systems, communication networks, and queueing networks. We propose a sample efficient RL method that accelerates learning by generating augmented data samples. The proposed algorithm is data-driven and learns the policy from data samples from both real and augmented samples. This method significantly improves learning by reducing the sample complexity such that the dataset only needs to have sufficient coverage of the stochastic states. We analyze the sample complexity of the proposed method under Fitted Q Iteration (FQI) and demonstrate that the optimality gap decreases as $\tilde{\mathcal{O}}(\sqrt{{1}/{n}}+\sqrt{{1}/{m}}),$ where $n$ is the number of real samples and $m$ is the number of augmented samples per real sample. It is important to note that without augmented samples, the optimality gap is $\tilde{\mathcal{O}}(1)$ due to insufficient data coverage of the pseudo-stochastic states. Our experimental results on multiple queueing network applications confirm that the proposed method indeed significantly accelerates learning in both deep Q-learning and deep policy gradient.
    Blind Federated Learning via Over-the-Air q-QAM. (arXiv:2311.04253v1 [eess.SP])
    In this work, we investigate federated edge learning over a fading multiple access channel. To alleviate the communication burden between the edge devices and the access point, we introduce a pioneering digital over-the-air computation strategy employing q-ary quadrature amplitude modulation, culminating in a low latency communication scheme. Indeed, we propose a new federated edge learning framework in which edge devices use digital modulation for over-the-air uplink transmission to the edge server while they have no access to the channel state information. Furthermore, we incorporate multiple antennas at the edge server to overcome the fading inherent in wireless communication. We analyze the number of antennas required to mitigate the fading impact effectively. We prove a non-asymptotic upper bound for the mean squared error for the proposed federated learning with digital over-the-air uplink transmissions under both noisy and fading conditions. Leveraging the derived upper bound, we characterize the convergence rate of the learning process of a non-convex loss function in terms of the mean square error of gradients due to the fading channel. Furthermore, we substantiate the theoretical assurances through numerical experiments concerning mean square error and the convergence efficacy of the digital federated edge learning framework. Notably, the results demonstrate that augmenting the number of antennas at the edge server and adopting higher-order modulations improve the model accuracy up to 60\%.
    Enhancing Multi-Agent Coordination through Common Operating Picture Integration. (arXiv:2311.04740v1 [cs.MA])
    In multi-agent systems, agents possess only local observations of the environment. Communication between teammates becomes crucial for enhancing coordination. Past research has primarily focused on encoding local information into embedding messages which are unintelligible to humans. We find that using these messages in agent's policy learning leads to brittle policies when tested on out-of-distribution initial states. We present an approach to multi-agent coordination, where each agent is equipped with the capability to integrate its (history of) observations, actions and messages received into a Common Operating Picture (COP) and disseminate the COP. This process takes into account the dynamic nature of the environment and the shared mission. We conducted experiments in the StarCraft2 environment to validate our approach. Our results demonstrate the efficacy of COP integration, and show that COP-based training leads to robust policies compared to state-of-the-art Multi-Agent Reinforcement Learning (MARL) methods when faced with out-of-distribution initial states.
    Question Answering for Electronic Health Records: A Scoping Review of datasets and models. (arXiv:2310.08759v2 [cs.LG] UPDATED)
    Question Answering (QA) systems on patient-related data can assist both clinicians and patients. They can, for example, assist clinicians in decision-making and enable patients to have a better understanding of their medical history. Significant amounts of patient data are stored in Electronic Health Records (EHRs), making EHR QA an important research area. In EHR QA, the answer is obtained from the medical record of the patient. Because of the differences in data format and modality, this differs greatly from other medical QA tasks that employ medical websites or scientific papers to retrieve answers, making it critical to research EHR question answering. This study aimed to provide a methodological review of existing works on QA over EHRs. We searched for articles from January 1st, 2005 to September 30th, 2023 in four digital sources including Google Scholar, ACL Anthology, ACM Digital Library, and PubMed to collect relevant publications on EHR QA. 4111 papers were identified for our study, and after screening based on our inclusion criteria, we obtained a total of 47 papers for further study. Out of the 47 papers, 25 papers were about EHR QA datasets, and 37 papers were about EHR QA models. It was observed that QA on EHRs is relatively new and unexplored. Most of the works are fairly recent. Also, it was observed that emrQA is by far the most popular EHR QA dataset, both in terms of citations and usage in other papers. Furthermore, we identified the different models used in EHR QA along with the evaluation metrics used for these models.
    Enhancing Malware Detection by Integrating Machine Learning with Cuckoo Sandbox. (arXiv:2311.04372v1 [cs.CR])
    In the modern era, malware is experiencing a significant increase in both its variety and quantity, aligning with the widespread adoption of the digital world. This surge in malware has emerged as a critical challenge in the realm of cybersecurity, prompting numerous research endeavors and contributions to address the issue. Machine learning algorithms have been leveraged for malware detection due to their ability to uncover concealed patterns within vast datasets. However, deep learning algorithms, characterized by their multi-layered structure, surpass the limitations of traditional machine learning approaches. By employing deep learning techniques such as CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network), this study aims to classify and identify malware extracted from a dataset containing API call sequences. The performance of these algorithms is compared with that of conventional machine learning methods, including SVM (Support Vector Machine), RF (Random Forest), KNN (K-Nearest Neighbors), XGB (Extreme Gradient Boosting), and GBC (Gradient Boosting Classifier), all using the same dataset. The outcomes of this research demonstrate that both deep learning and machine learning algorithms achieve remarkably high levels of accuracy, reaching up to 99% in certain cases.
    Implementation of Trained Factorization Machine Recommendation System on Quantum Annealer. (arXiv:2210.12953v2 [quant-ph] UPDATED)
    Factorization Machine (FM) is the most commonly used model to build a recommendation system since it can incorporate side information to improve performance. However, producing item suggestions for a given user with a trained FM is time-consuming. It requires a run-time of $O((N_m \log N_m)^2)$, where $N_m$ is the number of items in the dataset. To address this problem, we propose a quadratic unconstrained binary optimization (QUBO) scheme to combine with FM and apply quantum annealing (QA) computation. Compared to classical methods, this hybrid algorithm provides a faster than quadratic speedup in finding good user suggestions. We then demonstrate the aforementioned computational advantage on current NISQ hardware by experimenting with a real example on a D-Wave annealer.
    Hierarchical clustering with dot products recovers hidden tree structure. (arXiv:2305.15022v2 [stat.ML] UPDATED)
    In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.
    Improving Fairness in Deepfake Detection. (arXiv:2306.16635v3 [cs.CV] UPDATED)
    Despite the development of effective deepfake detectors in recent years, recent studies have demonstrated that biases in the data used to train these detectors can lead to disparities in detection accuracy across different races and genders. This can result in different groups being unfairly targeted or excluded from detection, allowing undetected deepfakes to manipulate public opinion and erode trust in a deepfake detection model. While existing studies have focused on evaluating fairness of deepfake detectors, to the best of our knowledge, no method has been developed to encourage fairness in deepfake detection at the algorithm level. In this work, we make the first attempt to improve deepfake detection fairness by proposing novel loss functions that handle both the setting where demographic information (eg, annotations of race and gender) is available as well as the case where this information is absent. Fundamentally, both approaches can be used to convert many existing deepfake detectors into ones that encourages fairness. Extensive experiments on four deepfake datasets and five deepfake detectors demonstrate the effectiveness and flexibility of our approach in improving deepfake detection fairness. Our code is available at https://github.com/littlejuyan/DF_Fairness.
    Kindness in Multi-Agent Reinforcement Learning. (arXiv:2311.04239v1 [cs.AI])
    In human societies, people often incorporate fairness in their decisions and treat reciprocally by being kind to those who act kindly. They evaluate the kindness of others' actions not only by monitoring the outcomes but also by considering the intentions. This behavioral concept can be adapted to train cooperative agents in Multi-Agent Reinforcement Learning (MARL). We propose the KindMARL method, where agents' intentions are measured by counterfactual reasoning over the environmental impact of the actions that were available to the agents. More specifically, the current environment state is compared with the estimation of the current environment state provided that the agent had chosen another action. The difference between each agent's reward, as the outcome of its action, with that of its fellow, multiplied by the intention of the fellow is then taken as the fellow's "kindness". If the result of each reward-comparison confirms the agent's superiority, it perceives the fellow's kindness and reduces its own reward. Experimental results in the Cleanup and Harvest environments show that training based on the KindMARL method enabled the agents to earn 89\% (resp. 37\%) and 44% (resp. 43\%) more total rewards than training based on the Inequity Aversion and Social Influence methods. The effectiveness of KindMARL is further supported by experiments in a traffic light control problem.
    Physics informed machine learning with Smoothed Particle Hydrodynamics: Hierarchy of reduced Lagrangian models of turbulence. (arXiv:2110.13311v7 [physics.flu-dyn] UPDATED)
    Building efficient, accurate and generalizable reduced order models of developed turbulence remains a major challenge. This manuscript approaches this problem by developing a hierarchy of parameterized reduced Lagrangian models for turbulent flows, and investigates the effects of enforcing physical structure through Smoothed Particle Hydrodynamics (SPH) versus relying on neural networks (NN)s as universal function approximators. Starting from Neural Network (NN) parameterizations of a Lagrangian acceleration operator, this hierarchy of models gradually incorporates a weakly compressible and parameterized SPH framework, which enforces physical symmetries, such as Galilean, rotational and translational invariances. Within this hierarchy, two new parameterized smoothing kernels are developed in order to increase the flexibility of the learn-able SPH simulators. For each model we experiment with different loss functions which are minimized using gradient based optimization, where efficient computations of gradients are obtained by using Automatic Differentiation (AD) and Sensitivity Analysis (SA). Each model within the hierarchy is trained on two data sets associated with weekly compressible Homogeneous Isotropic Turbulence (HIT): (1) a validation set using weakly compressible SPH; and (2) a high fidelity set from Direct Numerical Simulations (DNS). Numerical evidence shows that encoding more SPH structure improves generalizability to different turbulent Mach numbers and time shifts, and that including the novel parameterized smoothing kernels improves the accuracy of SPH at the resolved scales.
    Implicit models, latent compression, intrinsic biases, and cheap lunches in community detection. (arXiv:2210.09186v7 [cs.SI] UPDATED)
    The task of community detection, which aims to partition a network into clusters of nodes to summarize its large-scale structure, has spawned the development of many competing algorithms with varying objectives. Some community detection methods are inferential, explicitly deriving the clustering objective through a probabilistic generative model, while other methods are descriptive, dividing a network according to an objective motivated by a particular application, making it challenging to compare these methods on the same scale. Here we present a solution to this problem that associates any community detection objective, inferential or descriptive, with its corresponding implicit network generative model. This allows us to compute the description length of a network and its partition under arbitrary objectives, providing a principled measure to compare the performance of different algorithms without the need for "ground truth" labels. Our approach also gives access to instances of the community detection problem that are optimal to any given algorithm, and in this way reveals intrinsic biases in popular descriptive methods, explaining their tendency to overfit. Using our framework, we compare a number of community detection methods on artificial networks, and on a corpus of over 500 structurally diverse empirical networks. We find that more expressive community detection methods exhibit consistently superior compression performance on structured data instances, without having degraded performance on a minority of situations where more specialized algorithms perform optimally. Our results undermine the implications of the "no free lunch" theorem for community detection, both conceptually and in practice, since it is confined to unstructured data instances, unlike relevant community detection problems which are structured by requirement.
    PTW: Pivotal Tuning Watermarking for Pre-Trained Image Generators. (arXiv:2304.07361v3 [cs.LG] UPDATED)
    Deepfakes refer to content synthesized using deep generators, which, when misused, have the potential to erode trust in digital media. Synthesizing high-quality deepfakes requires access to large and complex generators only a few entities can train and provide. The threat is malicious users that exploit access to the provided model and generate harmful deepfakes without risking detection. Watermarking makes deepfakes detectable by embedding an identifiable code into the generator that is later extractable from its generated images. We propose Pivotal Tuning Watermarking (PTW), a method for watermarking pre-trained generators (i) three orders of magnitude faster than watermarking from scratch and (ii) without the need for any training data. We improve existing watermarking methods and scale to generators $4 \times$ larger than related work. PTW can embed longer codes than existing methods while better preserving the generator's image quality. We propose rigorous, game-based definitions for robustness and undetectability, and our study reveals that watermarking is not robust against an adaptive white-box attacker who controls the generator's parameters. We propose an adaptive attack that can successfully remove any watermarking with access to only 200 non-watermarked images. Our work challenges the trustworthiness of watermarking for deepfake detection when the parameters of a generator are available. The source code to reproduce our experiments is available at https://github.com/nilslukas/gan-watermark.
    FEIR: Quantifying and Reducing Envy and Inferiority for Fair Recommendation of Limited Resources. (arXiv:2311.04542v1 [cs.IR])
    In settings such as e-recruitment and online dating, recommendation involves distributing limited opportunities, calling for novel approaches to quantify and enforce fairness. We introduce \emph{inferiority}, a novel (un)fairness measure quantifying a user's competitive disadvantage for their recommended items. Inferiority complements \emph{envy}, a fairness notion measuring preference for others' recommendations. We combine inferiority and envy with \emph{utility}, an accuracy-related measure of aggregated relevancy scores. Since these measures are non-differentiable, we reformulate them using a probabilistic interpretation of recommender systems, yielding differentiable versions. We combine these loss functions in a multi-objective optimization problem called \texttt{FEIR} (Fairness through Envy and Inferiority Reduction), applied as post-processing for standard recommender systems. Experiments on synthetic and real-world data demonstrate that our approach improves trade-offs between inferiority, envy, and utility compared to naive recommendations and the baseline methods.
    DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets. (arXiv:2311.04894v1 [cs.CV])
    Construction of a universal detector poses a crucial question: How can we most effectively train a model on a large mixture of datasets? The answer lies in learning dataset-specific features and ensembling their knowledge but do all this in a single model. Previous methods achieve this by having separate detection heads on a common backbone but that results in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose Dataset-Aware Mixture-of-Experts, DAMEX where we train the experts to become an `expert' of a dataset by learning to route each dataset tokens to its mapped expert. Experiments on Universal Object-Detection Benchmark show that we outperform the existing state-of-the-art by average +10.2 AP score and improve over our non-MoE baseline by average +2.0 AP score. We also observe consistent gains while mixing datasets with (1) limited availability, (2) disparate domains and (3) divergent label sets. Further, we qualitatively show that DAMEX is robust against expert representation collapse.
    Assessing the Generalization Gap of Learning-Based Speech Enhancement Systems in Noisy and Reverberant Environments. (arXiv:2309.06183v2 [eess.AS] UPDATED)
    The acoustic variability of noisy and reverberant speech mixtures is influenced by multiple factors, such as the spectro-temporal characteristics of the target speaker and the interfering noise, the signal-to-noise ratio (SNR) and the room characteristics. This large variability poses a major challenge for learning-based speech enhancement systems, since a mismatch between the training and testing conditions can substantially reduce the performance of the system. Generalization to unseen conditions is typically assessed by testing the system with a new speech, noise or binaural room impulse response (BRIR) database different from the one used during training. However, the difficulty of the speech enhancement task can change across databases, which can substantially influence the results. The present study introduces a generalization assessment framework that uses a reference model trained on the test condition, such that it can be used as a proxy for the difficulty of the test condition. This allows to disentangle the effect of the change in task difficulty from the effect of dealing with new data, and thus to define a new measure of generalization performance termed the generalization gap. The procedure is repeated in a cross-validation fashion by cycling through multiple speech, noise, and BRIR databases to accurately estimate the generalization gap. The proposed framework is applied to evaluate the generalization potential of a feedforward neural network (FFNN), Conv-TasNet, DCCRN and MANNER. We find that for all models, the performance degrades the most in speech mismatches, while good noise and room generalization can be achieved by training on multiple databases. Moreover, while recent models show higher performance in matched conditions, their performance substantially decreases in mismatched conditions and can become inferior to that of the FFNN-based system.
    Byzantine-Tolerant Methods for Distributed Variational Inequalities. (arXiv:2311.04611v1 [cs.LG])
    Robustness to Byzantine attacks is a necessity for various distributed training scenarios. When the training reduces to the process of solving a minimization problem, Byzantine robustness is relatively well-understood. However, other problem formulations, such as min-max problems or, more generally, variational inequalities, arise in many modern machine learning and, in particular, distributed learning tasks. These problems significantly differ from the standard minimization ones and, therefore, require separate consideration. Nevertheless, only one work (Adibi et al., 2022) addresses this important question in the context of Byzantine robustness. Our work makes a further step in this direction by providing several (provably) Byzantine-robust methods for distributed variational inequality, thoroughly studying their theoretical convergence, removing the limitations of the previous work, and providing numerical comparisons supporting the theoretical findings.
    Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation when the SCAR assumption does not hold. (arXiv:2303.08269v2 [cs.LG] UPDATED)
    Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (e.g., viable drugs among untested compounds). Most PU learning algorithms make the selected completely at random (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (e.g., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, $\alpha$, of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms can estimate $\alpha$ or the probability of an individual unlabeled instance being positive or both. We propose two PU learning algorithms to estimate $\alpha$, calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR uses a divide-and-conquer approach that creates and solves several SCAR-like sub-problems using PULSCAR. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.
    GPT-ST: Generative Pre-Training of Spatio-Temporal Graph Neural Networks. (arXiv:2311.04245v1 [cs.LG])
    In recent years, there has been a rapid development of spatio-temporal prediction techniques in response to the increasing demands of traffic management and travel planning. While advanced end-to-end models have achieved notable success in improving predictive performance, their integration and expansion pose significant challenges. This work aims to address these challenges by introducing a spatio-temporal pre-training framework that seamlessly integrates with downstream baselines and enhances their performance. The framework is built upon two key designs: (i) We propose a spatio-temporal mask autoencoder as a pre-training model for learning spatio-temporal dependencies. The model incorporates customized parameter learners and hierarchical spatial pattern encoding networks. These modules are specifically designed to capture spatio-temporal customized representations and intra- and inter-cluster region semantic relationships, which have often been neglected in existing approaches. (ii) We introduce an adaptive mask strategy as part of the pre-training mechanism. This strategy guides the mask autoencoder in learning robust spatio-temporal representations and facilitates the modeling of different relationships, ranging from intra-cluster to inter-cluster, in an easy-to-hard training manner. Extensive experiments conducted on representative benchmarks demonstrate the effectiveness of our proposed method. We have made our model implementation publicly available at https://github.com/HKUDS/GPT-ST.
    Designing Robust Transformers using Robust Kernel Density Estimation. (arXiv:2210.05794v3 [cs.LG] UPDATED)
    Recent advances in Transformer architectures have empowered their empirical success in a variety of tasks across different domains. However, existing works mainly focus on predictive accuracy and computational cost, without considering other practical issues, such as robustness to contaminated samples. Recent work by Nguyen et al., (2022) has shown that the self-attention mechanism, which is the center of the Transformer architecture, can be viewed as a non-parametric estimator based on kernel density estimation (KDE). This motivates us to leverage a set of robust kernel density estimation methods for alleviating the issue of data contamination. Specifically, we introduce a series of self-attention mechanisms that can be incorporated into different Transformer architectures and discuss the special properties of each method. We then perform extensive empirical studies on language modeling and image classification tasks. Our methods demonstrate robust performance in multiple scenarios while maintaining competitive results on clean datasets.
    Environmental-Impact Based Multi-Agent Reinforcement Learning. (arXiv:2311.04240v1 [cs.AI])
    To promote cooperation and strengthen the individual impact on the collective outcome in social dilemmas, we propose the Environmental-impact Multi-Agent Reinforcement Learning (EMuReL) method where each agent estimates the "environmental impact" of every other agent, that is, the difference in the current environment state compared to the hypothetical environment in the absence of that other agent. Inspired by the Inequity Aversion model, the agent then compares its own reward with those of its fellows multiplied by their environmental impacts. If its reward exceeds the scaled reward of one of its fellows, the agent takes "social responsibility" toward that fellow by reducing its own reward. Therefore, the less influential an agent is in reaching the current state, the more social responsibility is taken by other agents. Experiments in the Cleanup (resp. Harvest) test environment demonstrate that agents trained based on EMuReL learn to cooperate more effectively and obtain $54\%$ ($39\%$) and $20\%$ ($44\%$) more total rewards while preserving the same cooperation levels compared to when they are trained based on the two state-of-the-art reward reshaping methods inequity aversion and social influence.
    Why Do Clinical Probabilistic Models Fail To Transport Between Sites?. (arXiv:2311.04787v1 [cs.LG])
    The rising popularity of artificial intelligence in healthcare is highlighting the problem that a computational model achieving super-human clinical performance at its training sites may perform substantially worse at new sites. In this perspective, we present common sources for this failure to transport, which we divide into sources under the control of the experimenter and sources inherent to the clinical data-generating process. Of the inherent sources we look a little deeper into site-specific clinical practices that can affect the data distribution, and propose a potential solution intended to isolate the imprint of those practices on the data from the patterns of disease cause and effect that are the usual target of clinical models.
    Bridging Dimensions: Confident Reachability for High-Dimensional Controllers. (arXiv:2311.04843v1 [cs.LG])
    Autonomous systems are increasingly implemented using end-end-end trained controllers. Such controllers make decisions that are executed on the real system with images as one of the primary sensing modalities. Deep neural networks form a fundamental building block of such controllers. Unfortunately, the existing neural-network verification tools do not scale to inputs with thousands of dimensions. Especially when the individual inputs (such as pixels) are devoid of clear physical meaning. This paper takes a step towards connecting exhaustive closed-loop verification with high-dimensional controllers. Our key insight is that the behavior of a high-dimensional controller can be approximated with several low-dimensional controllers in different regions of the state space. To balance approximation and verifiability, we leverage the latest verification-aware knowledge distillation. Then, if low-dimensional reachability results are inflated with statistical approximation errors, they yield a high-confidence reachability guarantee for the high-dimensional controller. We investigate two inflation techniques -- based on trajectories and actions -- both of which show convincing performance in two OpenAI gym benchmarks.
    Twitter Sentiment Analysis of Covid Vacciness. (arXiv:2311.04479v1 [cs.CL])
    In this paper, we look at a database of tweets sorted by various keywords that could indicate the users sentiment towards covid vaccines. With social media becoming such a prevalent source of opinion, sorting and ranking tweets that hold important information such as opinions on covid vaccines is of utmost importance. Two different ranking scales were used, and ranking a tweet in this way could represent the difference between an opinion being lost and an opinion being featured on the site, which affects the decisions and behavior of people, and why researchers were interested in it. Using natural language processing techniques, our aim is to determine and categorize opinions about covid vaccines with the highest accuracy possible.
    Robust and Communication-Efficient Federated Domain Adaptation via Random Features. (arXiv:2311.04686v1 [cs.LG])
    Modern machine learning (ML) models have grown to a scale where training them on a single machine becomes impractical. As a result, there is a growing trend to leverage federated learning (FL) techniques to train large ML models in a distributed and collaborative manner. These models, however, when deployed on new devices, might struggle to generalize well due to domain shifts. In this context, federated domain adaptation (FDA) emerges as a powerful approach to address this challenge. Most existing FDA approaches typically focus on aligning the distributions between source and target domains by minimizing their (e.g., MMD) distance. Such strategies, however, inevitably introduce high communication overheads and can be highly sensitive to network reliability. In this paper, we introduce RF-TCA, an enhancement to the standard Transfer Component Analysis approach that significantly accelerates computation without compromising theoretical and empirical performance. Leveraging the computational advantage of RF-TCA, we further extend it to FDA setting with FedRF-TCA. The proposed FedRF-TCA protocol boasts communication complexity that is \emph{independent} of the sample size, while maintaining performance that is either comparable to or even surpasses state-of-the-art FDA methods. We present extensive experiments to showcase the superior performance and robustness (to network condition) of FedRF-TCA.
    Certified Data Removal from Machine Learning Models. (arXiv:1911.03030v6 [cs.LG] UPDATED)
    Good data stewardship requires removal of data at the request of the data's owner. This raises the question if and how a trained machine-learning model, which implicitly stores information about its training data, should be affected by such a removal request. Is it possible to "remove" data from a machine-learning model? We study this problem by defining certified removal: a very strong theoretical guarantee that a model from which data is removed cannot be distinguished from a model that never observed the data to begin with. We develop a certified-removal mechanism for linear classifiers and empirically study learning settings in which this mechanism is practical.
    MixTEA: Semi-supervised Entity Alignment with Mixture Teaching. (arXiv:2311.04441v1 [cs.LG])
    Semi-supervised entity alignment (EA) is a practical and challenging task because of the lack of adequate labeled mappings as training data. Most works address this problem by generating pseudo mappings for unlabeled entities. However, they either suffer from the erroneous (noisy) pseudo mappings or largely ignore the uncertainty of pseudo mappings. In this paper, we propose a novel semi-supervised EA method, termed as MixTEA, which guides the model learning with an end-to-end mixture teaching of manually labeled mappings and probabilistic pseudo mappings. We firstly train a student model using few labeled mappings as standard. More importantly, in pseudo mapping learning, we propose a bi-directional voting (BDV) strategy that fuses the alignment decisions in different directions to estimate the uncertainty via the joint matching confidence score. Meanwhile, we also design a matching diversity-based rectification (MDR) module to adjust the pseudo mapping learning, thus reducing the negative influence of noisy mappings. Extensive results on benchmark datasets as well as further analyses demonstrate the superiority and the effectiveness of our proposed method.
    Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies. (arXiv:2302.01734v2 [cs.LG] UPDATED)
    Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed the development of their theoretical foundations. Despite the huge efforts directed at the design of efficient stochastic PG-type algorithms, the understanding of their convergence to a globally optimal policy is still limited. In this work, we develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies which allows to address the case of continuous state action spaces. First, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $\tilde{\mathcal{O}}(\varepsilon^{-2.5})$ sample complexity of this method for finding a global $\varepsilon$-optimal policy. Improving over the previously known $\tilde{\mathcal{O}}(\varepsilon^{-3})$ complexity, this algorithm does not require the use of importance sampling or second-order information and samples only one trajectory per iteration. Second, we further improve this complexity to $\tilde{ \mathcal{\mathcal{O}} }(\varepsilon^{-2})$ by considering a Hessian-Aided Recursive Policy Gradient ((N)-HARPG) algorithm enhanced with a correction based on a Hessian-vector product. Interestingly, both algorithms are $(i)$ simple and easy to implement: single-loop, do not require large batches of trajectories and sample at most two trajectories per iteration; $(ii)$ computationally and memory efficient: they do not require expensive subroutines at each iteration and can be implemented with memory linear in the dimension of parameters.
    Fair Without Leveling Down: A New Intersectional Fairness Definition. (arXiv:2305.12495v2 [cs.LG] UPDATED)
    In this work, we consider the problem of intersectional group fairness in the classification setting, where the objective is to learn discrimination-free models in the presence of several intersecting sensitive groups. First, we illustrate various shortcomings of existing fairness measures commonly used to capture intersectional fairness. Then, we propose a new definition called the $\alpha$-Intersectional Fairness, which combines the absolute and the relative performance across sensitive groups and can be seen as a generalization of the notion of differential fairness. We highlight several desirable properties of the proposed definition and analyze its relation to other fairness measures. Finally, we benchmark multiple popular in-processing fair machine learning approaches using our new fairness definition and show that they do not achieve any improvement over a simple baseline. Our results reveal that the increase in fairness measured by previous definitions hides a "leveling down" effect, i.e., degrading the best performance over groups rather than improving the worst one.
    Can LLMs Follow Simple Rules?. (arXiv:2311.04235v1 [cs.AI])
    As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Evaluating how well LLMs follow developer-provided rules in the face of adversarial inputs typically requires manual review, which slows down monitoring and methods development. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 15 simple text scenarios in which the model is instructed to obey a set of rules in natural language while interacting with the human user. Each scenario has a concise evaluation program to determine whether the model has broken any rules in a conversation. Through manual exploration of model behavior in our scenarios, we identify 6 categories of attack strategies and collect two suites of test cases: one consisting of unique conversations from manual testing and one that systematically implements strategies from the 6 categories. Across various popular proprietary and open models such as GPT-4 and Llama 2, we find that all models are susceptible to a wide variety of adversarial hand-crafted user inputs, though GPT-4 is the best-performing model. Additionally, we evaluate open models under gradient-based attacks and find significant vulnerabilities. We propose RuLES as a challenging new setting for research into exploring and defending against both manual and automatic attacks on LLMs.
    Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression. (arXiv:2306.15063v2 [cs.LG] UPDATED)
    Pretrained transformers exhibit the remarkable ability of in-context learning (ICL): they can learn tasks from just a few examples provided in the prompt without updating any weights. This raises a foundational question: can ICL solve fundamentally $\textit{new}$ tasks that are very different from those seen during pretraining? To probe this question, we examine ICL's performance on linear regression while varying the diversity of tasks in the pretraining dataset. We empirically demonstrate a $\textit{task diversity threshold}$ for the emergence of ICL. Below this threshold, the pretrained transformer cannot solve unseen regression tasks, instead behaving like a Bayesian estimator with the $\textit{non-diverse pretraining task distribution}$ as the prior. Beyond this threshold, the transformer significantly outperforms this estimator; its behavior aligns with that of ridge regression, corresponding to a Gaussian prior over $\textit{all tasks}$, including those not seen during pretraining. Thus, when pretrained on data with task diversity greater than the threshold, transformers $\textit{can}$ optimally solve fundamentally new tasks in-context. Importantly, this capability hinges on it deviating from the Bayes optimal estimator with the pretraining distribution as the prior. This study also explores the effect of regularization, model capacity and task structure and underscores, in a concrete example, the critical role of task diversity, alongside data and model scale, in the emergence of ICL. Code is available at https://github.com/mansheej/icl-task-diversity.
    Conditional Sampling of Variational Autoencoders via Iterated Approximate Ancestral Sampling. (arXiv:2308.09078v2 [cs.LG] UPDATED)
    Conditional sampling of variational autoencoders (VAEs) is needed in various applications, such as missing data imputation, but is computationally intractable. A principled choice for asymptotically exact conditional sampling is Metropolis-within-Gibbs (MWG). However, we observe that the tendency of VAEs to learn a structured latent space, a commonly desired property, can cause the MWG sampler to get "stuck" far from the target distribution. This paper mitigates the limitations of MWG: we systematically outline the pitfalls in the context of VAEs, propose two original methods that address these pitfalls, and demonstrate an improved performance of the proposed methods on a set of sampling tasks.
    Physics-Informed Graph Convolutional Networks: Towards a generalized framework for complex geometries. (arXiv:2310.14948v3 [cs.LG] UPDATED)
    Since the seminal work of [9] and their Physics-Informed neural networks (PINNs), many efforts have been conducted towards solving partial differential equations (PDEs) with Deep Learning models. However, some challenges remain, for instance the extension of such models to complex three-dimensional geometries, and a study on how such approaches could be combined to classical numerical solvers. In this work, we justify the use of graph neural networks for these problems, based on the similarity between these architectures and the meshes used in traditional numerical techniques for solving partial differential equations. After proving an issue with the Physics-Informed framework for complex geometries, during the computation of PDE residuals, an alternative procedure is proposed, by combining classical numerical solvers and the Physics-Informed framework. Finally, we propose an implementation of this approach, that we test on a three-dimensional problem on an irregular geometry.
    Unifying Structure and Language Semantic for Efficient Contrastive Knowledge Graph Completion with Structured Entity Anchors. (arXiv:2311.04250v1 [cs.AI])
    The goal of knowledge graph completion (KGC) is to predict missing links in a KG using trained facts that are already known. In recent, pre-trained language model (PLM) based methods that utilize both textual and structural information are emerging, but their performances lag behind state-of-the-art (SOTA) structure-based methods or some methods lose their inductive inference capabilities in the process of fusing structure embedding to text encoder. In this paper, we propose a novel method to effectively unify structure information and language semantics without losing the power of inductive reasoning. We adopt entity anchors and these anchors and textual description of KG elements are fed together into the PLM-based encoder to learn unified representations. In addition, the proposed method utilizes additional random negative samples which can be reused in the each mini-batch during contrastive learning to learn a generalized entity representations. We verify the effectiveness of the our proposed method through various experiments and analysis. The experimental results on standard benchmark widely used in link prediction task show that the proposed model outperforms existing the SOTA KGC models. Especially, our method show the largest performance improvement on FB15K-237, which is competitive to the SOTA of structure-based KGC methods.
    Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How. (arXiv:2311.04898v1 [cs.LG])
    Recent years have seen considerable progress in the continual training of deep neural networks, predominantly thanks to approaches that add replay or regularization terms to the loss function to approximate the joint loss over all tasks so far. However, we show that even with a perfect approximation to the joint loss, these approaches still suffer from temporary but substantial forgetting when starting to train on a new task. Motivated by this 'stability gap', we propose that continual learning strategies should focus not only on the optimization objective, but also on the way this objective is optimized. While there is some continual learning work that alters the optimization trajectory (e.g., using gradient projection techniques), this line of research is positioned as alternative to improving the optimization objective, while we argue it should be complementary. To evaluate the merits of our proposition, we plan to combine replay-approximated joint objectives with gradient projection-based optimization routines to test whether the addition of the latter provides benefits in terms of (1) alleviating the stability gap, (2) increasing the learning efficiency and (3) improving the final learning outcome.
    MELEP: A Novel Predictive Measure of Transferability in Multi-Label ECG Analysis. (arXiv:2311.04224v1 [eess.SP])
    We introduce MELEP, which stands for Muti-label Expected Log of Empirical Predictions, a novel measure to estimate how effective it is to transfer knowledge from a pre-trained model to a downstream task in a multi-label settings. The measure is generic to work with new target data having a different label set from source data. It is also computationally efficient, only requires forward passing the downstream dataset through the pre-trained model once. To the best of our knowledge, we are the first to develop such a transferability metric for multi-label ECG classification problems. Our experiments show that MELEP can predict the performance of pre-trained convolutional and recurrent deep neural networks, on small and imbalanced ECG data. Specifically, strong correlation coefficients, with absolute values exceeding 0.6 in most cases, were observed between MELEP and the actual average F1 scores of the fine-tuned models.
    Convex Methods for Constrained Linear Bandits. (arXiv:2311.04338v1 [cs.LG])
    Recently, bandit optimization has received significant attention in real-world safety-critical systems that involve repeated interactions with humans. While there exist various algorithms with performance guarantees in the literature, practical implementation of the algorithms has not received as much attention. This work presents a comprehensive study on the computational aspects of safe bandit algorithms, specifically safe linear bandits, by introducing a framework that leverages convex programming tools to create computationally efficient policies. In particular, we first characterize the properties of the optimal policy for safe linear bandit problem and then propose an end-to-end pipeline of safe linear bandit algorithms that only involves solving convex problems. We also numerically evaluate the performance of our proposed methods.
    Towards Few-Annotation Learning in Computer Vision: Application to Image Classification and Object Detection tasks. (arXiv:2311.04888v1 [cs.CV])
    In this thesis, we develop theoretical, algorithmic and experimental contributions for Machine Learning with limited labels, and more specifically for the tasks of Image Classification and Object Detection in Computer Vision. In a first contribution, we are interested in bridging the gap between theory and practice for popular Meta-Learning algorithms used in Few-Shot Classification. We make connections to Multi-Task Representation Learning, which benefits from solid theoretical foundations, to verify the best conditions for a more efficient meta-learning. Then, to leverage unlabeled data when training object detectors based on the Transformer architecture, we propose both an unsupervised pretraining and a semi-supervised learning method in two other separate contributions. For pretraining, we improve Contrastive Learning for object detectors by introducing the localization information. Finally, our semi-supervised method is the first tailored to transformer-based detectors.
    PB-LLM: Partially Binarized Large Language Models. (arXiv:2310.00034v2 [cs.LG] UPDATED)
    This paper explores network binarization, a radical form of quantization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression. Due to previous binarization methods collapsing LLMs, we propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs. Specifically, our exploration first uncovers the ineffectiveness of naive applications of existing binarization algorithms and highlights the imperative role of salient weights in achieving low-bit quantization. Thus, PB-LLM filters a small ratio of salient weights during binarization, allocating them to higher-bit storage, i.e., partially-binarization. PB-LLM is extended to recover the capacities of quantized LMMs, by analyzing from the perspective of post-training quantization (PTQ) and quantization-aware training (QAT). Under PTQ, combining the concepts from GPTQ, we reconstruct the binarized weight matrix guided by the Hessian matrix and successfully recover the reasoning capacity of PB-LLM in low-bit. Under QAT, we freeze the salient weights during training, explore the derivation of optimal scaling factors crucial for minimizing the quantization error, and propose a scaling mechanism based on this derived scaling strategy for residual binarized weights. Those explorations and the developed methodologies significantly contribute to rejuvenating the performance of low-bit quantized LLMs and present substantial advancements in the field of network binarization for LLMs.The code is available at https://github.com/hahnyuan/BinaryLLM.
    Analysis and Applications of Deep Learning with Finite Samples in Full Life-Cycle Intelligence of Nuclear Power Generation. (arXiv:2311.04247v1 [cs.LG])
    The advent of Industry 4.0 has precipitated the incorporation of Artificial Intelligence (AI) methods within industrial contexts, aiming to realize intelligent manufacturing, operation as well as maintenance, also known as industrial intelligence. However, intricate industrial milieus, particularly those relating to energy exploration and production, frequently encompass data characterized by long-tailed class distribution, sample imbalance, and domain shift. These attributes pose noteworthy challenges to data-centric Deep Learning (DL) techniques, crucial for the realization of industrial intelligence. The present study centers on the intricate and distinctive industrial scenarios of Nuclear Power Generation (NPG), meticulously scrutinizing the application of DL techniques under the constraints of finite data samples. Initially, the paper expounds on potential employment scenarios for AI across the full life-cycle of NPG. Subsequently, we delve into an evaluative exposition of DL's advancement, grounded in the finite sample perspective. This encompasses aspects such as small-sample learning, few-shot learning, zero-shot learning, and open-set recognition, also referring to the unique data characteristics of NPG. The paper then proceeds to present two specific case studies. The first revolves around the automatic recognition of zirconium alloy metallography, while the second pertains to open-set recognition for signal diagnosis of machinery sensors. These cases, spanning the entirety of NPG's life-cycle, are accompanied by constructive outcomes and insightful deliberations. By exploring and applying DL methodologies within the constraints of finite sample availability, this paper not only furnishes a robust technical foundation but also introduces a fresh perspective toward the secure and efficient advancement and exploitation of this advanced energy source.
    On Characterizing the Evolution of Embedding Space of Neural Networks using Algebraic Topology. (arXiv:2311.04592v1 [cs.LG])
    We study how the topology of feature embedding space changes as it passes through the layers of a well-trained deep neural network (DNN) through Betti numbers. Motivated by existing studies using simplicial complexes on shallow fully connected networks (FCN), we present an extended analysis using Cubical homology instead, with a variety of popular deep architectures and real image datasets. We demonstrate that as depth increases, a topologically complicated dataset is transformed into a simple one, resulting in Betti numbers attaining their lowest possible value. The rate of decay in topological complexity (as a metric) helps quantify the impact of architectural choices on the generalization ability. Interestingly from a representation learning perspective, we highlight several invariances such as topological invariance of (1) an architecture on similar datasets; (2) embedding space of a dataset for architectures of variable depth; (3) embedding space to input resolution/size, and (4) data sub-sampling. In order to further demonstrate the link between expressivity \& the generalization capability of a network, we consider the task of ranking pre-trained models for downstream classification task (transfer learning). Compared to existing approaches, the proposed metric has a better correlation to the actually achievable accuracy via fine-tuning the pre-trained model.
    A Deep Learning Based Resource Allocator for Communication Systems with Dynamic User Utility Demands. (arXiv:2311.04600v1 [eess.SP])
    Deep learning (DL) based resource allocation (RA) has recently gained a lot of attention due to its performance efficiency. However, most of the related studies assume an ideal case where the number of users and their utility demands, e.g., data rate constraints, are fixed and the designed DL based RA scheme exploits a policy trained only for these fixed parameters. A computationally complex policy retraining is required whenever these parameters change. Therefore, in this paper, a DL based resource allocator (ALCOR) is introduced, which allows users to freely adjust their utility demands based on, e.g., their application layer. ALCOR employs deep neural networks (DNNs), as the policy, in an iterative optimization algorithm. The optimization algorithm aims to optimize the on-off status of users in a time-sharing problem to satisfy their utility demands in expectation. The policy performs unconstrained RA (URA) -- RA without taking into account user utility demands -- among active users to maximize the sum utility (SU) at each time instant. Based on the chosen URA scheme, ALCOR can perform RA in a model-based or model-free manner and in a centralized or distributed scenario. Derived convergence analyses provide guarantees for the convergence of ALCOR, and numerical experiments corroborate its effectiveness.
    Assessing Upper Limb Motor Function in the Immediate Post-Stroke Perioud Using Accelerometry. (arXiv:2311.04226v1 [cs.LG])
    Accelerometry has been extensively studied as an objective means of measuring upper limb function in patients post-stroke. The objective of this paper is to determine whether the accelerometry-derived measurements frequently used in more long-term rehabilitation studies can also be used to monitor and rapidly detect sudden changes in upper limb motor function in more recently hospitalized stroke patients. Six binary classification models were created by training on variable data window times of paretic upper limb accelerometer feature data. The models were assessed on their effectiveness for differentiating new input data into two classes: severe or moderately severe motor function. The classification models yielded Area Under the Curve (AUC) scores that ranged from 0.72 to 0.82 for 15-minute data windows to 0.77 to 0.94 for 120-minute data windows. These results served as a preliminary assessment and a basis on which to further investigate the efficacy of using accelerometry and machine learning to alert healthcare professionals to rapid changes in motor function in the days immediately following a stroke.
    Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models. (arXiv:2311.04378v1 [cs.LG])
    Watermarking generative models consists of planting a statistical signal (watermark) in a model's output so that it can be later verified that the output was generated by the given model. A strong watermarking scheme satisfies the property that a computationally bounded attacker cannot erase the watermark without causing significant quality degradation. In this paper, we study the (im)possibility of strong watermarking schemes. We prove that, under well-specified and natural assumptions, strong watermarking is impossible to achieve. This holds even in the private detection algorithm setting, where the watermark insertion and detection algorithms share a secret key, unknown to the attacker. To prove this result, we introduce a generic efficient watermark attack; the attacker is not required to know the private key of the scheme or even which scheme is used. Our attack is based on two assumptions: (1) The attacker has access to a "quality oracle" that can evaluate whether a candidate output is a high-quality response to a prompt, and (2) The attacker has access to a "perturbation oracle" which can modify an output with a nontrivial probability of maintaining quality, and which induces an efficiently mixing random walk on high-quality outputs. We argue that both assumptions can be satisfied in practice by an attacker with weaker computational capabilities than the watermarked model itself, to which the attacker has only black-box access. Furthermore, our assumptions will likely only be easier to satisfy over time as models grow in capabilities and modalities. We demonstrate the feasibility of our attack by instantiating it to attack three existing watermarking schemes for large language models: Kirchenbauer et al. (2023), Kuditipudi et al. (2023), and Zhao et al. (2023). The same attack successfully removes the watermarks planted by all three schemes, with only minor quality degradation.
    Natural Bayesian Cram\'er-Rao Bound with an Application to Covariance Estimation. (arXiv:2311.04748v1 [math.ST])
    In this paper, we propose to develop a new Cram\'er-Rao Bound (CRB) when the parameter to estimate lies in a manifold and follows a prior distribution. This derivation leads to a natural inequality between an error criteria based on geometrical properties and this new bound. This main contribution is illustrated in the problem of covariance estimation when the data follow a Gaussian distribution and the prior distribution is an inverse Wishart. Numerical simulation shows new results where the proposed CRB allows to exhibit interesting properties of the MAP estimator which are not observed with the classical Bayesian CRB.
    How to select an objective function using information theory. (arXiv:2212.06566v3 [cs.LG] UPDATED)
    In machine learning or scientific computing, model performance is measured with an objective function. But why choose one objective over another? Information theory gives one answer: To maximize the information in the model, select the objective function that represents the error in the fewest bits. To evaluate different objectives, transform them into likelihood functions. As likelihoods, their relative magnitude represents how strongly we should prefer one objective versus another, and the log of that relation represents the difference in their bit-length, as well as the difference in their uncertainty. In other words, prefer whichever objective minimizes the uncertainty. Under the information-theoretic paradigm, the ultimate objective is to maximize information (and minimize uncertainty), as opposed to any specific utility. We argue that this paradigm is well-suited to models that have many uses and no definite utility, like the large Earth system models used to understand the effects of climate change.
    Recursion in Recursion: Two-Level Nested Recursion for Length Generalization with Scalability. (arXiv:2311.04449v1 [cs.LG])
    Binary Balanced Tree RvNNs (BBT-RvNNs) enforce sequence composition according to a preset balanced binary tree structure. Thus, their non-linear recursion depth is just $\log_2 n$ ($n$ being the sequence length). Such logarithmic scaling makes BBT-RvNNs efficient and scalable on long sequence tasks such as Long Range Arena (LRA). However, such computational efficiency comes at a cost because BBT-RvNNs cannot solve simple arithmetic tasks like ListOps. On the flip side, RvNNs (e.g., Beam Tree RvNN) that do succeed on ListOps (and other structure-sensitive tasks like formal logical inference) are generally several times more expensive than even RNNs. In this paper, we introduce a novel framework -- Recursion in Recursion (RIR) to strike a balance between the two sides - getting some of the benefits from both worlds. In RIR, we use a form of two-level nested recursion - where the outer recursion is a $k$-ary balanced tree model with another recursive model (inner recursion) implementing its cell function. For the inner recursion, we choose Beam Tree RvNNs (BT-RvNN). To adjust BT-RvNNs within RIR we also propose a novel strategy of beam alignment. Overall, this entails that the total recursive depth in RIR is upper-bounded by $k \log_k n$. Our best RIR-based model is the first model that demonstrates high ($\geq 90\%$) length-generalization performance on ListOps while at the same time being scalable enough to be trainable on long sequence inputs from LRA. Moreover, in terms of accuracy in the LRA language tasks, it performs competitively with Structured State Space Models (SSMs) without any special initialization - outperforming Transformers by a large margin. On the other hand, while SSMs can marginally outperform RIR on LRA, they (SSMs) fail to length-generalize on ListOps. Our code is available at: \url{https://github.com/JRC1995/BeamRecursionFamily/}.
    Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models. (arXiv:2311.04902v1 [cs.CL])
    Large Language Models (LLMs) with a billion or more parameters are prime targets for network pruning, which aims to reduce a portion of the network weights without compromising performance. Prior approaches such as Weights Magnitude, SparseGPT, and Wanda, either concentrated solely on weights or integrated weights with activations for sparsity. However, they overlooked the informative gradients derived from pretrained large language models. In this paper, we present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner). GBLM-Pruner leverages the first-order term of the Taylor expansion, operating in a training-free manner by harnessing properly normalized gradients from a few calibration samples to determine the importance pruning score, and substantially outperforms competitive counterparts like SparseGPT and Wanda in multiple benchmarks. Intriguing, after incorporating gradients, the unstructured pruning method tends to reveal some structural patterns post-pruning, which mirrors the geometric interdependence inherent in the LLMs' parameter structure. Additionally, GBLM-Pruner functions without any subsequent retraining or weight updates to maintain its simplicity as other counterparts. Extensive evaluations on LLaMA-1 and LLaMA-2 across various language benchmarks and perplexity show that GBLM-Pruner surpasses magnitude pruning, Wanda (weights+activations) and SparseGPT (weights+activations+weight update) by significant margins. Our code and models are available at https://github.com/RocktimJyotiDas/GBLM-Pruner.
    Fast, accurate, and interpretable decoding of electrocorticographic signals using dynamic mode decomposition. (arXiv:2311.04225v1 [eess.SP])
    Dynamic mode (DM) decomposition decomposes spatiotemporal signals into basic oscillatory components (DMs). DMs can improve the accuracy of neural decoding when used with the nonlinear Grassmann kernel, compared to conventional power features. However, such kernel-based machine learning algorithms have three limitations: large computational time preventing real-time application, incompatibility with non-kernel algorithms, and low interpretability. Here, we propose a mapping function corresponding to the Grassmann kernel that explicitly transforms DMs into spatial DM (sDM) features, which can be used in any machine learning algorithm. Using electrocorticographic signals recorded during various movement and visual perception tasks, the sDM features were shown to improve the decoding accuracy and computational time compared to conventional methods. Furthermore, the components of the sDM features informative for decoding showed similar characteristics to the high-$\gamma$ power of the signals, but with higher trial-to-trial reproducibility. The proposed sDM features enable fast, accurate, and interpretable neural decoding.
    Likelihood Ratio Confidence Sets for Sequential Decision Making. (arXiv:2311.04402v1 [cs.LG])
    Certifiable, adaptive uncertainty estimates for unknown quantities are an essential ingredient of sequential decision-making algorithms. Standard approaches rely on problem-dependent concentration results and are limited to a specific combination of parameterization, noise family, and estimator. In this paper, we revisit the likelihood-based inference principle and propose to use likelihood ratios to construct any-time valid confidence sequences without requiring specialized treatment in each application scenario. Our method is especially suitable for problems with well-specified likelihoods, and the resulting sets always maintain the prescribed coverage in a model-agnostic manner. The size of the sets depends on a choice of estimator sequence in the likelihood ratio. We discuss how to provably choose the best sequence of estimators and shed light on connections to online convex optimization with algorithms such as Follow-the-Regularized-Leader. To counteract the initially large bias of the estimators, we propose a reweighting scheme that also opens up deployment in non-parametric settings such as RKHS function classes. We provide a non-asymptotic analysis of the likelihood ratio confidence sets size for generalized linear models, using insights from convex duality and online learning. We showcase the practical strength of our method on generalized linear bandit problems, survival analysis, and bandits with various additive noise distributions.
    What Can We Learn from Unlearnable Datasets?. (arXiv:2305.19254v3 [cs.LG] UPDATED)
    In an era of widespread web scraping, unlearnable dataset methods have the potential to protect data privacy by preventing deep neural networks from generalizing. But in addition to a number of practical limitations that make their use unlikely, we make a number of findings that call into question their ability to safeguard data. First, it is widely believed that neural networks trained on unlearnable datasets only learn shortcuts, simpler rules that are not useful for generalization. In contrast, we find that networks actually can learn useful features that can be reweighed for high test performance, suggesting that image protection is not assured. Unlearnable datasets are also believed to induce learning shortcuts through linear separability of added perturbations. We provide a counterexample, demonstrating that linear separability of perturbations is not a necessary condition. To emphasize why linearly separable perturbations should not be relied upon, we propose an orthogonal projection attack which allows learning from unlearnable datasets published in ICML 2021 and ICLR 2023. Our proposed attack is significantly less complex than recently proposed techniques.
    Exploring Best Practices for ECG Signal Processing in Machine Learning. (arXiv:2311.04229v1 [eess.SP])
    In this work we search for best practices in pre-processing of Electrocardiogram (ECG) signals in order to train better classifiers for the diagnosis of heart conditions. State of the art machine learning algorithms have achieved remarkable results in classification of some heart conditions using ECG data, yet there appears to be no consensus on pre-processing best practices. Is this lack of consensus due to different conditions and architectures requiring different processing steps for optimal performance? Is it possible that state of the art deep-learning models have rendered pre-processing unnecessary? In this work we apply down-sampling, normalization, and filtering functions to 3 different multi-label ECG datasets and measure their effects on 3 different high-performing time-series classifiers. We find that sampling rates as low as 50Hz can yield comparable results to the commonly used 500Hz. This is significant as smaller sampling rates will result in smaller datasets and models, which require less time and resources to train. Additionally, despite their common usage, we found min-max normalization to be slightly detrimental overall, and band-passing to make no measurable difference. We found the blind approach to pre-processing of ECGs for multi-label classification to be ineffective, with the exception of sample rate reduction which reliably reduces computational resources, but does not increase accuracy.
    The PetShop Dataset -- Finding Causes of Performance Issues across Microservices. (arXiv:2311.04806v1 [cs.DC])
    Identifying root causes for unexpected or undesirable behavior in complex systems is a prevalent challenge. This issue becomes especially crucial in modern cloud applications that employ numerous microservices. Although the machine learning and systems research communities have proposed various techniques to tackle this problem, there is currently a lack of standardized datasets for quantitative benchmarking. Consequently, research groups are compelled to create their own datasets for experimentation. This paper introduces a dataset specifically designed for evaluating root cause analyses in microservice-based applications. The dataset encompasses latency, requests, and availability metrics emitted in 5-minute intervals from a distributed application. In addition to normal operation metrics, the dataset includes 68 injected performance issues, which increase latency and reduce availability throughout the system. We showcase how this dataset can be used to evaluate the accuracy of a variety of methods spanning different causal and non-causal characterisations of the root cause analysis problem. We hope the new dataset, available at https://github.com/amazon-science/petshop-root-cause-analysis/ enables further development of techniques in this important area.
    A Lightweight Architecture for Real-Time Neuronal-Spike Classification. (arXiv:2311.04808v1 [cs.AR])
    Electrophysiological recordings of neural activity in a mouse's brain are very popular among neuroscientists for understanding brain function. One particular area of interest is acquiring recordings from the Purkinje cells in the cerebellum in order to understand brain injuries and the loss of motor functions. However, current setups for such experiments do not allow the mouse to move freely and, thus, do not capture its natural behaviour since they have a wired connection between the animal's head stage and an acquisition device. In this work, we propose a lightweight neuronal-spike detection and classification architecture that leverages on the unique characteristics of the Purkinje cells to discard unneeded information from the sparse neural data in real time. This allows the (condensed) data to be easily stored on a removable storage device on the head stage, alleviating the need for wires. Our proposed implementation shows a >95% overall classification accuracy while still resulting in a small-form-factor design, which allows for the free movement of mice during experiments. Moreover, the power-efficient nature of the design and the usage of STT-RAM (Spin Transfer Torque Magnetic Random Access Memory) as the removable storage allows the head stage to easily operate on a tiny battery for up to approximately 4 days.
    Speech language models lack important brain-relevant semantics. (arXiv:2311.04664v1 [cs.CL])
    Despite known differences between reading and listening in the brain, recent work has shown that text-based language models predict both text-evoked and speech-evoked brain activity to an impressive degree. This poses the question of what types of information language models truly predict in the brain. We investigate this question via a direct approach, in which we eliminate information related to specific low-level stimulus features (textual, speech, and visual) in the language model representations, and observe how this intervention affects the alignment with fMRI brain recordings acquired while participants read versus listened to the same naturalistic stories. We further contrast our findings with speech-based language models, which would be expected to predict speech-evoked brain activity better, provided they model language processing in the brain well. Using our direct approach, we find that both text-based and speech-based language models align well with early sensory regions due to shared low-level features. Text-based models continue to align well with later language regions even after removing these features, while, surprisingly, speech-based models lose most of their alignment. These findings suggest that speech-based models can be further improved to better reflect brain-like language processing.
    AI-Enabled Unmanned Vehicle-Assisted Reconfigurable Intelligent Surfaces: Deployment, Prototyping, Experiments, and Opportunities. (arXiv:2311.04241v1 [eess.SP])
    The requirement of wireless data demands is increasingly high as the sixth-generation (6G) technology evolves. Reconfigurable intelligent surface (RIS) is promisingly deemed to be one of 6G techniques for extending service coverage, reducing power consumption, and enhancing spectral efficiency. In this article, we have provided some fundamentals of RIS deployment in theory and hardware perspectives as well as utilization of artificial intelligence (AI) and machine learning. We conducted an intelligent deployment of RIS (i-Dris) prototype, including dual-band auto-guided vehicle (AGV) assisted RISs associated with an mmWave base station (BS) and a receiver. The RISs are deployed on the AGV with configured incident/reflection angles. While, both the mmWave BS and receiver are associated with an edge server monitoring downlink packets for obtaining system throughput. We have designed a federated multi-agent reinforcement learning scheme associated with several AGV-RIS agents and sub-agents per AGV-RIS consisting of the deployment of position, height, orientation and elevation angles. The experimental results presented the stationary measurement in different aspects and scenarios. The i-Dris can reach up to 980 Mbps transmission throughput under a bandwidth of 100 MHz with comparably low complexity as well as rapid deployment, which outperforms the other existing works. At last, we highlight some opportunities and future issues in leveraging RIS-empowered wireless communication networks.
    Determination of toxic comments and unintended model bias minimization using Deep learning approach. (arXiv:2311.04789v1 [cs.LG])
    Online conversations can be toxic and subjected to threats, abuse, or harassment. To identify toxic text comments, several deep learning and machine learning models have been proposed throughout the years. However, recent studies demonstrate that because of the imbalances in the training data, some models are more likely to show unintended biases including gender bias and identity bias. In this research, our aim is to detect toxic comment and reduce the unintended bias concerning identity features such as race, gender, sex, religion by fine-tuning an attention based model called BERT(Bidirectional Encoder Representation from Transformers). We apply weighted loss to address the issue of unbalanced data and compare the performance of a fine-tuned BERT model with a traditional Logistic Regression model in terms of classification and bias minimization. The Logistic Regression model with the TFIDF vectorizer achieve 57.1% accuracy, and fine-tuned BERT model's accuracy is 89%. Code is available at https://github.com/zim10/Determine_Toxic_comment_and_identity_bias.git
    Causal Scoring: A Framework for Effect Estimation, Effect Ordering, and Effect Classification. (arXiv:2206.12532v3 [stat.ML] UPDATED)
    This paper introduces causal scoring as a novel approach to frame causal estimation in the context of decision making. Causal scoring entails the estimation of scores that support decision making by providing insights into causal effects. We present three valuable causal interpretations of these scores: effect estimation (EE), effect ordering (EO), and effect classification (EC). In the EE interpretation, the causal score represents the effect itself. The EO interpretation implies that the score can serve as a proxy for the magnitude of the effect, enabling the sorting of individuals based on their causal effects. The EC interpretation enables the classification of individuals into high- and low-effect categories using a predefined threshold. We demonstrate the value of these alternative causal interpretations (EO and EC) through two key results. First, we show that aligning the statistical modeling with the desired causal interpretation improves the accuracy of causal estimation. Second, we establish that more flexible causal interpretations are plausible in a wider range of data-generating processes and propose conditions to assess their validity. We showcase the practical utility of the causal scoring framework through examples in diverse fields such as advertising, healthcare, and education, illustrating how it facilitates reasoning about flexible causal interpretations of statistical estimates in various contexts. The examples encompass confounded estimates, effect estimates on surrogate outcomes, and even predictions about non-causal quantities as potential causal scores.
    Sharp Spectral Rates for Koopman Operator Learning. (arXiv:2302.02004v4 [cs.LG] UPDATED)
    Nonlinear dynamical systems can be handily described by the associated Koopman operator, whose action evolves every observable of the system forward in time. Learning the Koopman operator and its spectral decomposition from data is enabled by a number of algorithms. In this work we present for the first time non-asymptotic learning bounds for the Koopman eigenvalues and eigenfunctions. We focus on time-reversal-invariant stochastic dynamical systems, including the important example of Langevin dynamics. We analyze two popular estimators: Extended Dynamic Mode Decomposition (EDMD) and Reduced Rank Regression (RRR). Our results critically hinge on novel {minimax} estimation bounds for the operator norm error, that may be of independent interest. Our spectral learning bounds are driven by the simultaneous control of the operator norm error and a novel metric distortion functional of the estimated eigenfunctions. The bounds indicates that both EDMD and RRR have similar variance, but EDMD suffers from a larger bias which might be detrimental to its learning rate. Our results shed new light on the emergence of spurious eigenvalues, an issue which is well known empirically. Numerical experiments illustrate the implications of the bounds in practice.
    RankAug: Augmented data ranking for text classification. (arXiv:2311.04535v1 [cs.CL])
    Research on data generation and augmentation has been focused majorly on enhancing generation models, leaving a notable gap in the exploration and refinement of methods for evaluating synthetic data. There are several text similarity metrics within the context of generated data filtering which can impact the performance of specific Natural Language Understanding (NLU) tasks, specifically focusing on intent and sentiment classification. In this study, we propose RankAug, a text-ranking approach that detects and filters out the top augmented texts in terms of being most similar in meaning with lexical and syntactical diversity. Through experiments conducted on multiple datasets, we demonstrate that the judicious selection of filtering techniques can yield a substantial improvement of up to 35% in classification accuracy for under-represented classes.
    Device Sampling and Resource Optimization for Federated Learning in Cooperative Edge Networks. (arXiv:2311.04350v1 [cs.NI])
    The conventional federated learning (FedL) architecture distributes machine learning (ML) across worker devices by having them train local models that are periodically aggregated by a server. FedL ignores two important characteristics of contemporary wireless networks, however: (i) the network may contain heterogeneous communication/computation resources, and (ii) there may be significant overlaps in devices' local data distributions. In this work, we develop a novel optimization methodology that jointly accounts for these factors via intelligent device sampling complemented by device-to-device (D2D) offloading. Our optimization methodology aims to select the best combination of sampled nodes and data offloading configuration to maximize FedL training accuracy while minimizing data processing and D2D communication resource consumption subject to realistic constraints on the network topology and device capabilities. Theoretical analysis of the D2D offloading subproblem leads to new FedL convergence bounds and an efficient sequential convex optimizer. Using these results, we develop a sampling methodology based on graph convolutional networks (GCNs) which learns the relationship between network attributes, sampled nodes, and D2D data offloading to maximize FedL accuracy. Through evaluation on popular datasets and real-world network measurements from our edge testbed, we find that our methodology outperforms popular device sampling methodologies from literature in terms of ML model performance, data processing overhead, and energy consumption.
    Lidar Annotation Is All You Need. (arXiv:2311.04777v1 [cs.CV])
    In recent years, computer vision has transformed fields such as medical imaging, object recognition, and geospatial analytics. One of the fundamental tasks in computer vision is semantic image segmentation, which is vital for precise object delineation. Autonomous driving represents one of the key areas where computer vision algorithms are applied. The task of road surface segmentation is crucial in self-driving systems, but it requires a labor-intensive annotation process in several data domains. The work described in this paper aims to improve the efficiency of image segmentation using a convolutional neural network in a multi-sensor setup. This approach leverages lidar (Light Detection and Ranging) annotations to directly train image segmentation models on RGB images. Lidar supplements the images by emitting laser pulses and measuring reflections to provide depth information. However, lidar's sparse point clouds often create difficulties for accurate object segmentation. Segmentation of point clouds requires time-consuming preliminary data preparation and a large amount of computational resources. The key innovation of our approach is the masked loss, addressing sparse ground-truth masks from point clouds. By calculating loss exclusively where lidar points exist, the model learns road segmentation on images by using lidar points as ground truth. This approach allows for blending of different ground-truth data types during model training. Experimental validation of the approach on benchmark datasets shows comparable performance to a high-quality image segmentation model. Incorporating lidar reduces the load on annotations and enables training of image-segmentation models without loss of segmentation quality. The methodology is tested on diverse datasets, both publicly available and proprietary. The strengths and weaknesses of the proposed method are also discussed in the paper.
    Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection. (arXiv:2311.04588v1 [cs.LG])
    Machine Learning (ML) models become vulnerable to Model Stealing Attacks (MSA) when they are deployed as a service. In such attacks, the deployed model is queried repeatedly to build a labelled dataset. This dataset allows the attacker to train a thief model that mimics the original model. To maximize query efficiency, the attacker has to select the most informative subset of data points from the pool of available data. Existing attack strategies utilize approaches like Active Learning and Semi-Supervised learning to minimize costs. However, in the black-box setting, these approaches may select sub-optimal samples as they train only one thief model. Depending on the thief model's capacity and the data it was pretrained on, the model might even select noisy samples that harm the learning process. In this work, we explore the usage of an ensemble of deep learning models as our thief model. We call our attack Army of Thieves(AOT) as we train multiple models with varying complexities to leverage the crowd's wisdom. Based on the ensemble's collective decision, uncertain samples are selected for querying, while the most confident samples are directly included in the training data. Our approach is the first one to utilize an ensemble of thief models to perform model extraction. We outperform the base approaches of existing state-of-the-art methods by at least 3% and achieve a 21% higher adversarial sample transferability than previous work for models trained on the CIFAR-10 dataset.
    Information-Theoretic Generalization Bounds for Transductive Learning and its Applications. (arXiv:2311.04561v1 [cs.LG])
    In this paper, we develop data-dependent and algorithm-dependent generalization bounds for transductive learning algorithms in the context of information theory for the first time. We show that the generalization gap of transductive learning algorithms can be bounded by the mutual information between training labels and hypothesis. By innovatively proposing the concept of transductive supersamples, we go beyond the inductive learning setting and establish upper bounds in terms of various information measures. Furthermore, we derive novel PAC-Bayesian bounds and build the connection between generalization and loss landscape flatness under the transductive learning setting. Finally, we present the upper bounds for adaptive optimization algorithms and demonstrate the applications of results on semi-supervised learning and graph learning scenarios. Our theoretic results are validated on both synthetic and real-world datasets.
    Predicting Properties of Nodes via Community-Aware Features. (arXiv:2311.04730v1 [cs.SI])
    A community structure that is often present in complex networks plays an important role not only in their formation but also shapes dynamics of these networks, affecting properties of their nodes. In this paper, we propose a family of community-aware node features and then investigate their properties. We show that they have high predictive power for classification tasks. We also verify that they contain information that cannot be recovered neither by classical node features nor by node embeddings (both classical as well as structural).
    IoT-Based Environmental Control System for Fish Farms with Sensor Integration and Machine Learning Decision Support. (arXiv:2311.04258v1 [eess.SP])
    In response to the burgeoning global demand for seafood and the challenges of managing fish farms, we introduce an innovative IoT based environmental control system that integrates sensor technology and advanced machine learning decision support. Deploying a network of wireless sensors within the fish farm, we continuously collect real-time data on crucial environmental parameters, including water temperature, pH levels, humidity, and fish behavior. This data undergoes meticulous preprocessing to ensure its reliability, including imputation, outlier detection, feature engineering, and synchronization. At the heart of our system are four distinct machine learning algorithms: Random Forests predict and optimize water temperature and pH levels for the fish, fostering their health and growth; Support Vector Machines (SVMs) function as an early warning system, promptly detecting diseases and parasites in fish; Gradient Boosting Machines (GBMs) dynamically fine-tune the feeding schedule based on real-time environmental conditions, promoting resource efficiency and fish productivity; Neural Networks manage the operation of critical equipment like water pumps and heaters to maintain the desired environmental conditions within the farm. These machine learning algorithms collaboratively make real-time decisions to ensure that the fish farm's environmental conditions align with predefined specifications, leading to improved fish health and productivity while simultaneously reducing resource wastage, thereby contributing to increased profitability and sustainability. This research article showcases the power of data-driven decision support in fish farming, promising to meet the growing demand for seafood while emphasizing environmental responsibility and economic viability, thus revolutionizing the future of fish farming.
    LRM: Large Reconstruction Model for Single Image to 3D. (arXiv:2311.04400v1 [cs.CV])
    We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. In contrast to many previous methods that are trained on small-scale datasets such as ShapeNet in a category-specific fashion, LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet. This combination of a high-capacity model and large-scale training data empowers our model to be highly generalizable and produce high-quality 3D reconstructions from various testing inputs including real-world in-the-wild captures and images from generative models. Video demos and interactable 3D meshes can be found on this website: https://yiconghong.me/LRM/.
    Strategies for Parallelizing the Big-Means Algorithm: A Comprehensive Tutorial for Effective Big Data Clustering. (arXiv:2311.04517v1 [cs.DC])
    This study focuses on the optimization of the Big-means algorithm for clustering large-scale datasets, exploring four distinct parallelization strategies. We conducted extensive experiments to assess the computational efficiency, scalability, and clustering performance of each approach, revealing their benefits and limitations. The paper also delves into the trade-offs between computational efficiency and clustering quality, examining the impacts of various factors. Our insights provide practical guidance on selecting the best parallelization strategy based on available resources and dataset characteristics, contributing to a deeper understanding of parallelization techniques for the Big-means algorithm.
    Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates. (arXiv:2309.08125v2 [cs.DC] UPDATED)
    Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least $f+1$ logically equivalent pipeline replicas to tolerate any $f$ simultaneous failures. During execution, it relies on already-replicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after $f$ or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to $29.6x$.
    Benchmark Generation Framework with Customizable Distortions for Image Classifier Robustness. (arXiv:2310.18626v2 [cs.CV] UPDATED)
    We present a novel framework for generating adversarial benchmarks to evaluate the robustness of image classification models. Our framework allows users to customize the types of distortions to be optimally applied to images, which helps address the specific distortions relevant to their deployment. The benchmark can generate datasets at various distortion levels to assess the robustness of different image classifiers. Our results show that the adversarial samples generated by our framework with any of the image classification models, like ResNet-50, Inception-V3, and VGG-16, are effective and transferable to other models causing them to fail. These failures happen even when these models are adversarially retrained using state-of-the-art techniques, demonstrating the generalizability of our adversarial samples. We achieve competitive performance in terms of net $L_2$ distortion compared to state-of-the-art benchmark techniques on CIFAR-10 and ImageNet; however, we demonstrate our framework achieves such results with simple distortions like Gaussian noise without introducing unnatural artifacts or color bleeds. This is made possible by a model-based reinforcement learning (RL) agent and a technique that reduces a deep tree search of the image for model sensitivity to perturbations, to a one-level analysis and action. The flexibility of choosing distortions and setting classification probability thresholds for multiple classes makes our framework suitable for algorithmic audits.
    Lie Point Symmetry and Physics Informed Networks. (arXiv:2311.04293v1 [cs.LG])
    Symmetries have been leveraged to improve the generalization of neural networks through different mechanisms from data augmentation to equivariant architectures. However, despite their potential, their integration into neural solvers for partial differential equations (PDEs) remains largely unexplored. We explore the integration of PDE symmetries, known as Lie point symmetries, in a major family of neural solvers known as physics-informed neural networks (PINNs). We propose a loss function that informs the network about Lie point symmetries in the same way that PINN models try to enforce the underlying PDE through a loss function. Intuitively, our symmetry loss ensures that the infinitesimal generators of the Lie group conserve the PDE solutions. Effectively, this means that once the network learns a solution, it also learns the neighbouring solutions generated by Lie point symmetries. Empirical evaluations indicate that the inductive bias introduced by the Lie point symmetries of the PDEs greatly boosts the sample efficiency of PINNs.
    PDFTriage: Question Answering over Long, Structured Documents. (arXiv:2309.08872v2 [cs.CL] UPDATED)
    Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA. Our code and datasets will be released soon on Github.
    Multitask Kernel-based Learning with First-Order Logic Constraints. (arXiv:2311.03340v2 [cs.LG] UPDATED)
    In this paper we propose a general framework to integrate supervised and unsupervised examples with background knowledge expressed by a collection of first-order logic clauses into kernel machines. In particular, we consider a multi-task learning scheme where multiple predicates defined on a set of objects are to be jointly learned from examples, enforcing a set of FOL constraints on the admissible configurations of their values. The predicates are defined on the feature spaces, in which the input objects are represented, and can be either known a priori or approximated by an appropriate kernel-based learner. A general approach is presented to convert the FOL clauses into a continuous implementation that can deal with the outputs computed by the kernel-based predicates. The learning problem is formulated as a semi-supervised task that requires the optimization in the primal of a loss function that combines a fitting loss measure on the supervised examples, a regularization term, and a penalty term that enforces the constraints on both the supervised and unsupervised examples. Unfortunately, the penalty term is not convex and it can hinder the optimization process. However, it is possible to avoid poor solutions by using a two stage learning schema, in which the supervised examples are learned first and then the constraints are enforced.
    Robust Best-arm Identification in Linear Bandits. (arXiv:2311.04731v1 [cs.LG])
    We study the robust best-arm identification problem (RBAI) in the case of linear rewards. The primary objective is to identify a near-optimal robust arm, which involves selecting arms at every round and assessing their robustness by exploring potential adversarial actions. This approach is particularly relevant when utilizing a simulator and seeking to identify a robust solution for real-world transfer. To this end, we present an instance-dependent lower bound for the robust best-arm identification problem with linear rewards. Furthermore, we propose both static and adaptive bandit algorithms that achieve sample complexity that matches the lower bound. In synthetic experiments, our algorithms effectively identify the best robust arm and perform similarly to the oracle strategy. As an application, we examine diabetes care and the process of learning insulin dose recommendations that are robust with respect to inaccuracies in standard calculators. Our algorithms prove to be effective in identifying robust dosage values across various age ranges of patients.
    Uncertainty in GNN Learning Evaluations: The Importance of a Consistent Benchmark for Community Detection. (arXiv:2305.06026v4 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have improved unsupervised community detection of clustered nodes due to their ability to encode the dual dimensionality of the connectivity and feature information spaces of graphs. Identifying the latent communities has many practical applications from social networks to genomics. Current benchmarks of real world performance are confusing due to the variety of decisions influencing the evaluation of GNNs at this task. To address this, we propose a framework to establish a common evaluation protocol. We motivate and justify it by demonstrating the differences with and without the protocol. The W Randomness Coefficient is a metric proposed for assessing the consistency of algorithm rankings to quantify the reliability of results under the presence of randomness. We find that by ensuring the same evaluation criteria is followed, there may be significant differences from the reported performance of methods at this task, but a more complete evaluation and comparison of methods is possible.
    Foundational propositions of hesitant fuzzy sets and parameter reductions of hesitant fuzzy information systems. (arXiv:2311.04256v1 [cs.AI])
    Hesitant fuzzy sets are widely used in the instances of uncertainty and hesitation. The inclusion relationship is an important and foundational definition for sets. Hesitant fuzzy set, as a kind of set, needs explicit definition of inclusion relationship. Base on the hesitant fuzzy membership degree of discrete form, several kinds of inclusion relationships for hesitant fuzzy sets are proposed. And then some foundational propositions of hesitant fuzzy sets and the families of hesitant fuzzy sets are presented. Finally, some foundational propositions of hesitant fuzzy information systems with respect to parameter reductions are put forward, and an example and an algorithm are given to illustrate the processes of parameter reductions.
    Hybrid Focal and Full-Range Attention Based Graph Transformers. (arXiv:2311.04653v1 [cs.LG])
    The paradigm of Transformers using the self-attention mechanism has manifested its advantage in learning graph-structured data. Yet, Graph Transformers are capable of modeling full range dependencies but are often deficient in extracting information from locality. A common practice is to utilize Message Passing Neural Networks (MPNNs) as an auxiliary to capture local information, which however are still inadequate for comprehending substructures. In this paper, we present a purely attention-based architecture, namely Focal and Full-Range Graph Transformer (FFGT), which can mitigate the loss of local information in learning global correlations. The core component of FFGT is a new mechanism of compound attention, which combines the conventional full-range attention with K-hop focal attention on ego-nets to aggregate both global and local information. Beyond the scope of canonical Transformers, the FFGT has the merit of being more substructure-aware. Our approach enhances the performance of existing Graph Transformers on various open datasets, while achieves compatible SOTA performance on several Long-Range Graph Benchmark (LRGB) datasets even with a vanilla transformer. We further examine influential factors on the optimal focal length of attention via introducing a novel synthetic dataset based on SBM-PATTERN.
    An Unsupervised Deep Learning Approach for the Wave Equation Inverse Problem. (arXiv:2311.04531v1 [math.NA])
    Full-waveform inversion (FWI) is a powerful geophysical imaging technique that infers high-resolution subsurface physical parameters by solving a non-convex optimization problem. However, due to limitations in observation, e.g., limited shots or receivers, and random noise, conventional inversion methods are confronted with numerous challenges, such as the local-minimum problem. In recent years, a substantial body of work has demonstrated that the integration of deep neural networks and partial differential equations for solving full-waveform inversion problems has shown promising performance. In this work, drawing inspiration from the expressive capacity of neural networks, we provide an unsupervised learning approach aimed at accurately reconstructing subsurface physical velocity parameters. This method is founded on a re-parametrization technique for Bayesian inference, achieved through a deep neural network with random weights. Notably, our proposed approach does not hinge upon the requirement of the labeled training dataset, rendering it exceedingly versatile and adaptable to diverse subsurface models. Extensive experiments show that the proposed approach performs noticeably better than existing conventional inversion methods.
    Learning Performance-Improving Code Edits. (arXiv:2302.07867v4 [cs.SE] UPDATED)
    With the waning of Moore's law, optimizing program performance has become a major focus of software research. However, high-level optimizations such as API and algorithm changes remain elusive due to the difficulty of understanding the semantics of code. Simultaneously, pretrained large language models (LLMs) have demonstrated strong capabilities at solving a wide range of programming tasks. To that end, we introduce a framework for adapting LLMs to high-level program optimization. First, we curate a dataset of performance-improving edits made by human programmers of over 77K competitive C++ programming submission pairs, accompanied by extensive unit tests. A major challenge is the significant variability of measuring performance on commodity hardware, which can lead to spurious "improvements". To isolate and reliably evaluate the impact of program optimizations, we design an environment based on the gem5 full system simulator, the de facto simulator used in academia and industry. Next, we propose a broad range of adaptation strategies for code optimization; for prompting, these include retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play. A combination of these techniques achieves an average speedup of 5.65X on CodeLlama-13B and 6.86X on GPT-3.5, surpassing the best human performance (4.06X). We find our proposed performance-conditioned generation is particularly effective at improving performance as well as increasing the fraction of optimized programs.
    Extending Machine Learning-Based Early Sepsis Detection to Different Demographics. (arXiv:2311.04325v1 [cs.LG])
    Sepsis requires urgent diagnosis, but research is predominantly focused on Western datasets. In this study, we perform a comparative analysis of two ensemble learning methods, LightGBM and XGBoost, using the public eICU-CRD dataset and a private South Korean St. Mary's Hospital's dataset. Our analysis reveals the effectiveness of these methods in addressing healthcare data imbalance and enhancing sepsis detection. Specifically, LightGBM shows a slight edge in computational efficiency and scalability. The study paves the way for the broader application of machine learning in critical care, thereby expanding the reach of predictive analytics in healthcare globally.
    MarioGPT: Open-Ended Text2Level Generation through Large Language Models. (arXiv:2302.05981v3 [cs.AI] UPDATED)
    Procedural Content Generation (PCG) is a technique to generate complex and diverse environments in an automated way. However, while generating content with PCG methods is often straightforward, generating meaningful content that reflects specific intentions and constraints remains challenging. Furthermore, many PCG algorithms lack the ability to generate content in an open-ended manner. Recently, Large Language Models (LLMs) have shown to be incredibly effective in many diverse domains. These trained LLMs can be fine-tuned, re-using information and accelerating training for new tasks. Here, we introduce MarioGPT, a fine-tuned GPT2 model trained to generate tile-based game levels, in our case Super Mario Bros levels. MarioGPT can not only generate diverse levels, but can be text-prompted for controllable level generation, addressing one of the key challenges of current PCG techniques. As far as we know, MarioGPT is the first text-to-level model and combined with novelty search it enables the generation of diverse levels with varying play-style dynamics (i.e. player paths) and the open-ended discovery of an increasingly diverse range of content. Code available at https://github.com/shyamsn97/mario-gpt.
    Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction. (arXiv:2311.02898v2 [eess.AS] UPDATED)
    We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints. The proposed model first generates aligned semantic tokens using the neural transducer, then synthesizes a speech sample from the semantic tokens using a non-autoregressive(NAR) speech generator. This decoupled framework alleviates the training complexity of TTS and allows each stage to focus on 1) linguistic and alignment modeling and 2) fine-grained acoustic modeling, respectively. Experimental results on the zero-shot adaptive TTS show that the proposed model exceeds the baselines in speech quality and speaker similarity via objective and subjective measures. We also investigate the inference speed and prosody controllability of our proposed model, showing the potential of the neural transducer for TTS frameworks.
    Optimal Deep Neural Network Approximation for Korobov Functions with respect to Sobolev Norms. (arXiv:2311.04779v1 [math.NA])
    This paper establishes the nearly optimal rate of approximation for deep neural networks (DNNs) when applied to Korobov functions, effectively overcoming the curse of dimensionality. The approximation results presented in this paper are measured with respect to $L_p$ norms and $H^1$ norms. Our achieved approximation rate demonstrates a remarkable "super-convergence" rate, outperforming traditional methods and any continuous function approximator. These results are non-asymptotic, providing error bounds that consider both the width and depth of the networks simultaneously.
    Exploring Predicate Visual Context in Detecting Human-Object Interactions. (arXiv:2308.06202v2 [cs.CV] UPDATED)
    Recently, the DETR framework has emerged as the dominant approach for human--object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost.
    Learning Linear Gaussian Polytree Models with Interventions. (arXiv:2311.04636v1 [stat.ML])
    We present a consistent and highly scalable local approach to learn the causal structure of a linear Gaussian polytree using data from interventional experiments with known intervention targets. Our methods first learn the skeleton of the polytree and then orient its edges. The output is a CPDAG representing the interventional equivalence class of the polytree of the true underlying distribution. The skeleton and orientation recovery procedures we use rely on second order statistics and low-dimensional marginal distributions. We assess the performance of our methods under different scenarios in synthetic data sets and apply our algorithm to learn a polytree in a gene expression interventional data set. Our simulation studies demonstrate that our approach is fast, has good accuracy in terms of structural Hamming distance, and handles problems with thousands of nodes.
    From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion. (arXiv:2308.02560v2 [cs.SD] UPDATED)
    Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and, evaluation code, along with audio samples, are available on the facebookresearch/audiocraft Github page.
    Towards Open-world Cross-Domain Sequential Recommendation: A Model-Agnostic Contrastive Denoising Approach. (arXiv:2311.04760v1 [cs.IR])
    Cross-domain sequential recommendation (CDSR) aims to address the data sparsity problems that exist in traditional sequential recommendation (SR) systems. The existing approaches aim to design a specific cross-domain unit that can transfer and propagate information across multiple domains by relying on overlapping users with abundant behaviors. However, in real-world recommender systems, CDSR scenarios usually consist of a majority of long-tailed users with sparse behaviors and cold-start users who only exist in one domain. This leads to a drop in the performance of existing CDSR methods in the real-world industry platform. Therefore, improving the consistency and effectiveness of models in open-world CDSR scenarios is crucial for constructing CDSR models (\textit{1st} CH). Recently, some SR approaches have utilized auxiliary behaviors to complement the information for long-tailed users. However, these multi-behavior SR methods cannot deliver promising performance in CDSR, as they overlook the semantic gap between target and auxiliary behaviors, as well as user interest deviation across domains (\textit{2nd} CH).
    Object-Centric Learning with Slot Mixture Module. (arXiv:2311.04640v1 [cs.LG])
    Object-centric architectures usually apply a differentiable module to the entire feature map to decompose it into sets of entity representations called slots. Some of these methods structurally resemble clustering algorithms, where the cluster's center in latent space serves as a slot representation. Slot Attention is an example of such a method, acting as a learnable analog of the soft k-means algorithm. Our work employs a learnable clustering method based on the Gaussian Mixture Model. Unlike other approaches, we represent slots not only as centers of clusters but also incorporate information about the distance between clusters and assigned vectors, leading to more expressive slot representations. Our experiments demonstrate that using this approach instead of Slot Attention improves performance in object-centric scenarios, achieving state-of-the-art results in the set property prediction task.
    Solving High Frequency and Multi-Scale PDEs with Gaussian Processes. (arXiv:2311.04465v1 [cs.LG])
    Machine learning based solvers have garnered much attention in physical simulation and scientific computing, with a prominent example, physics-informed neural networks (PINNs). However, PINNs often struggle to solve high-frequency and multi-scale PDEs, which can be due to spectral bias during neural network training. To address this problem, we resort to the Gaussian process (GP) framework. To flexibly capture the dominant frequencies, we model the power spectrum of the PDE solution with a student t mixture or Gaussian mixture. We then apply the inverse Fourier transform to obtain the covariance function (according to the Wiener-Khinchin theorem). The covariance derived from the Gaussian mixture spectrum corresponds to the known spectral mixture kernel. We are the first to discover its rationale and effectiveness for PDE solving. Next,we estimate the mixture weights in the log domain, which we show is equivalent to placing a Jeffreys prior. It automatically induces sparsity, prunes excessive frequencies, and adjusts the remaining toward the ground truth. Third, to enable efficient and scalable computation on massive collocation points, which are critical to capture high frequencies, we place the collocation points on a grid, and multiply our covariance function at each input dimension. We use the GP conditional mean to predict the solution and its derivatives so as to fit the boundary condition and the equation itself. As a result, we can derive a Kronecker product structure in the covariance matrix. We use Kronecker product properties and multilinear algebra to greatly promote computational efficiency and scalability, without any low-rank approximations. We show the advantage of our method in systematic experiments.
    Towards a Unified Framework of Contrastive Learning for Disentangled Representations. (arXiv:2311.04774v1 [cs.LG])
    Contrastive learning has recently emerged as a promising approach for learning data representations that discover and disentangle the explanatory factors of the data. Previous analyses of such approaches have largely focused on individual contrastive losses, such as noise-contrastive estimation (NCE) and InfoNCE, and rely on specific assumptions about the data generating process. This paper extends the theoretical guarantees for disentanglement to a broader family of contrastive methods, while also relaxing the assumptions about the data distribution. Specifically, we prove identifiability of the true latents for four contrastive losses studied in this paper, without imposing common independence assumptions. The theoretical findings are validated on several benchmark datasets. Finally, practical limitations of these methods are also investigated.
    Identifying Semantic Component for Robust Molecular Property Prediction. (arXiv:2311.04837v1 [cs.LG])
    Although graph neural networks have achieved great success in the task of molecular property prediction in recent years, their generalization ability under out-of-distribution (OOD) settings is still under-explored. Different from existing methods that learn discriminative representations for prediction, we propose a generative model with semantic-components identifiability, named SCI. We demonstrate that the latent variables in this generative model can be explicitly identified into semantic-relevant (SR) and semantic-irrelevant (SI) components, which contributes to better OOD generalization by involving minimal change properties of causal mechanisms. Specifically, we first formulate the data generation process from the atom level to the molecular level, where the latent space is split into SI substructures, SR substructures, and SR atom variables. Sequentially, to reduce misidentification, we restrict the minimal changes of the SR atom variables and add a semantic latent substructure regularization to mitigate the variance of the SR substructure under augmented domain changes. Under mild assumptions, we prove the block-wise identifiability of the SR substructure and the comment-wise identifiability of SR atom variables. Experimental studies achieve state-of-the-art performance and show general improvement on 21 datasets in 3 mainstream benchmarks. Moreover, the visualization results of the proposed SCI method provide insightful case studies and explanations for the prediction results. The code is available at: https://github.com/DMIRLAB-Group/SCI.
    Evaluating Uncertainty Quantification approaches for Neural PDEs in scientific applications. (arXiv:2311.04457v1 [cs.LG])
    The accessibility of spatially distributed data, enabled by affordable sensors, field, and numerical experiments, has facilitated the development of data-driven solutions for scientific problems, including climate change, weather prediction, and urban planning. Neural Partial Differential Equations (Neural PDEs), which combine deep learning (DL) techniques with domain expertise (e.g., governing equations) for parameterization, have proven to be effective in capturing valuable correlations within spatiotemporal datasets. However, sparse and noisy measurements coupled with modeling approximation introduce aleatoric and epistemic uncertainties. Therefore, quantifying uncertainties propagated from model inputs to outputs remains a challenge and an essential goal for establishing the trustworthiness of Neural PDEs. This work evaluates various Uncertainty Quantification (UQ) approaches for both Forward and Inverse Problems in scientific applications. Specifically, we investigate the effectiveness of Bayesian methods, such as Hamiltonian Monte Carlo (HMC) and Monte-Carlo Dropout (MCD), and a more conventional approach, Deep Ensembles (DE). To illustrate their performance, we take two canonical PDEs: Burger's equation and the Navier-Stokes equation. Our results indicate that Neural PDEs can effectively reconstruct flow systems and predict the associated unknown parameters. However, it is noteworthy that the results derived from Bayesian methods, based on our observations, tend to display a higher degree of certainty in their predictions as compared to those obtained using the DE. This elevated certainty in predictions suggests that Bayesian techniques might underestimate the true underlying uncertainty, thereby appearing more confident in their predictions than the DE approach.
    Spectral Evolution and Invariance in Linear-width Neural Networks. (arXiv:2211.06506v2 [cs.LG] UPDATED)
    We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invariance of the bulk spectra for both conjugate and neural tangent kernels. We demonstrate similar characteristics when training with stochastic gradient descent with small learning rates. When the learning rate is large, we exhibit the emergence of an outlier whose corresponding eigenvector is aligned with the training data structure. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel matrices exhibit heavy tail behavior. Simple examples are provided to explain when heavy tails can have better generalizations. We exhibit different spectral properties such as invariant bulk, spike, and heavy-tailed distribution from a two-layer neural network using different training strategies, and then correlate them to the feature learning. Analogous phenomena also appear when we train conventional neural networks with real-world data. We conclude that monitoring the evolution of the spectra during training is an essential step toward understanding the training dynamics and feature learning.
    Accurate Autism Spectrum Disorder prediction using Support Vector Classifier based on Federated Learning (SVCFL). (arXiv:2311.04606v1 [cs.LG])
    The path to an autism diagnosis can be long and difficult, and delays can have serious consequences. Artificial intelligence can completely change the way autism is diagnosed, especially when it comes to situations where it is difficult to see the first signs of the disease. AI-based diagnostic tools may help confirm a diagnosis or highlight the need for further testing by analyzing large volumes of data and uncovering patterns that may not be immediately apparent to human evaluators. After a successful and timely diagnosis, autism can be treated through artificial intelligence using various methods. In this article, by using four datasets and gathering them with the federated learning method and diagnosing them with the support vector classifier method, the early diagnosis of this disorder has been discussed. In this method, we have achieved 99% accuracy for predicting autism spectrum disorder and we have achieved 13% improvement in the results.
    Long-term Time Series Forecasting based on Decomposition and Neural Ordinary Differential Equations. (arXiv:2311.04522v1 [cs.LG])
    Long-term time series forecasting (LTSF) is a challenging task that has been investigated in various domains such as finance investment, health care, traffic, and weather forecasting. In recent years, Linear-based LTSF models showed better performance, pointing out the problem of Transformer-based approaches causing temporal information loss. However, Linear-based approach has also limitations that the model is too simple to comprehensively exploit the characteristics of the dataset. To solve these limitations, we propose LTSF-DNODE, which applies a model based on linear ordinary differential equations (ODEs) and a time series decomposition method according to data statistical characteristics. We show that LTSF-DNODE outperforms the baselines on various real-world datasets. In addition, for each dataset, we explore the impacts of regularization in the neural ordinary differential equation (NODE) framework.
    TopP&R: Robust Support Estimation Approach for Evaluating Fidelity and Diversity in Generative Models. (arXiv:2306.08013v4 [cs.LG] UPDATED)
    We propose a robust and reliable evaluation metric for generative models by introducing topological and statistical treatments for rigorous support estimation. Existing metrics, such as Inception Score (IS), Frechet Inception Distance (FID), and the variants of Precision and Recall (P&R), heavily rely on supports that are estimated from sample features. However, the reliability of their estimation has not been seriously discussed (and overlooked) even though the quality of the evaluation entirely depends on it. In this paper, we propose Topological Precision and Recall (TopP&R, pronounced 'topper'), which provides a systematic approach to estimating supports, retaining only topologically and statistically important features with a certain level of confidence. This not only makes TopP&R strong for noisy features, but also provides statistical consistency. Our theoretical and experimental results show that TopP&R is robust to outliers and non-independent and identically distributed (Non-IID) perturbations, while accurately capturing the true trend of change in samples. To the best of our knowledge, this is the first evaluation metric focused on the robust estimation of the support and provides its statistical consistency under noise.
    Toward Rapid, Optimal, and Feasible Power Dispatch through Generalized Neural Mapping. (arXiv:2311.04838v1 [eess.SY])
    The evolution towards a more distributed and interconnected grid necessitates large-scale decision-making within strict temporal constraints. Machine learning (ML) paradigms have demonstrated significant potential in improving the efficacy of optimization processes. However, the feasibility of solutions derived from ML models continues to pose challenges. It's imperative that ML models produce solutions that are attainable and realistic within the given system constraints of power systems. To address the feasibility issue and expedite the solution search process, we proposed LOOP-LC 2.0(Learning to Optimize the Optimization Process with Linear Constraints version 2.0) as a learning-based approach for solving the power dispatch problem. A notable advantage of the LOOP-LC 2.0 framework is its ability to ensure near-optimality and strict feasibility of solutions without depending on computationally intensive post-processing procedures, thus eliminating the need for iterative processes. At the heart of the LOOP-LC 2.0 model lies the newly proposed generalized gauge map method, capable of mapping any infeasible solution to a feasible point within the linearly-constrained domain. The proposed generalized gauge map method improves the traditional gauge map by exhibiting reduced sensitivity to input variances while increasing search speeds significantly. Utilizing the IEEE-200 test case as a benchmark, we demonstrate the effectiveness of the LOOP-LC 2.0 methodology, confirming its superior performance in terms of training speed, computational time, optimality, and solution feasibility compared to existing methodologies.
    Joint control variate for faster black-box variational inference. (arXiv:2210.07290v3 [cs.LG] UPDATED)
    Black-box variational inference performance is sometimes hindered by the use of gradient estimators with high variance. This variance comes from two sources of randomness: Data subsampling and Monte Carlo sampling. While existing control variates only address Monte Carlo noise, and incremental gradient methods typically only address data subsampling, we propose a new "joint" control variate that jointly reduces variance from both sources of noise. This significantly reduces gradient variance, leading to faster optimization in several applications.
    Future Lens: Anticipating Subsequent Tokens from a Single Hidden State. (arXiv:2311.04897v1 [cs.CL])
    We conjecture that hidden state vectors corresponding to individual input tokens encode information sufficient to accurately predict several tokens ahead. More concretely, in this paper we ask: Given a hidden (internal) representation of a single token at position $t$ in an input, can we reliably anticipate the tokens that will appear at positions $\geq t + 2$? To test this, we measure linear approximation and causal intervention methods in GPT-J-6B to evaluate the degree to which individual hidden states in the network contain signal rich enough to predict future hidden states and, ultimately, token outputs. We find that, at some layers, we can approximate a model's output with more than 48% accuracy with respect to its prediction of subsequent tokens through a single hidden state. Finally we present a "Future Lens" visualization that uses these methods to create a new view of transformer states.
    Variational Classification. (arXiv:2305.10406v3 [cs.LG] UPDATED)
    We present a latent variable model for classification that provides a novel probabilistic interpretation of neural network softmax classifiers. We derive a variational objective to train the model, analogous to the evidence lower bound (ELBO) used to train variational auto-encoders, that generalises the cross-entropy loss used to train classification models. Treating inputs to the softmax layer as samples of a latent variable, our abstracted perspective reveals a potential inconsistency between their anticipated distribution, required for accurate label predictions to be output, and the empirical distribution found in practice. We augment the variational objective to mitigate such inconsistency and encourage a chosen latent distribution, instead of the implicit assumption in off-the-shelf softmax classifiers. Overall, we provide new theoretical insight into the inner workings of widely-used softmax classification. Empirical evaluation on image and text classification datasets demonstrates that our proposed approach, variational classification, maintains classification accuracy while the reshaped latent space improves other desirable properties of a classifier, such as calibration, adversarial robustness, robustness to distribution shift and sample efficiency useful in low data settings.
    HKTGNN: Hierarchical Knowledge Transferable Graph Neural Network-based Supply Chain Risk Assessment. (arXiv:2311.04244v1 [cs.LG])
    The strength of a supply chain is an important measure of a country's or region's technical advancement and overall competitiveness. Establishing supply chain risk assessment models for effective management and mitigation of potential risks has become increasingly crucial. As the number of businesses grows, the important relationships become more complicated and difficult to measure. This emphasizes the need of extracting relevant information from graph data. Previously, academics mostly employed knowledge inference to increase the visibility of links between nodes in the supply chain. However, they have not solved the data hunger problem of single node feature characteristics. We propose a hierarchical knowledge transferable graph neural network-based (HKTGNN) supply chain risk assessment model to address these issues. Our approach is based on current graph embedding methods for assessing corporate investment risk assessment. We embed the supply chain network corresponding to individual goods in the supply chain using the graph embedding module, resulting in a directed homogeneous graph with just product nodes. This reduces the complicated supply chain network into a basic product network. It addresses difficulties using the domain difference knowledge transferable module based on centrality, which is presented by the premise that supply chain feature characteristics may be biased in the actual world. Meanwhile, the feature complement and message passing will alleviate the data hunger problem, which is driven by domain differences. Our model outperforms in experiments on a real-world supply chain dataset. We will give an equation to prove that our comparative experiment is both effective and fair.
    Incorporating temporal dynamics of mutations to enhance the prediction capability of antiretroviral therapy's outcome for HIV-1. (arXiv:2311.04846v1 [cs.LG])
    Motivation: In predicting HIV therapy outcomes, a critical clinical question is whether using historical information can enhance predictive capabilities compared with current or latest available data analysis. This study analyses whether historical knowledge, which includes viral mutations detected in all genotypic tests before therapy, their temporal occurrence, and concomitant viral load measurements, can bring improvements. We introduce a method to weigh mutations, considering the previously enumerated factors and the reference mutation-drug Stanford resistance tables. We compare a model encompassing history (H) with one not using it (NH). Results: The H-model demonstrates superior discriminative ability, with a higher ROC-AUC score (76.34%) than the NH-model (74.98%). Significant Wilcoxon test results confirm that incorporating historical information improves consistently predictive accuracy for treatment outcomes. The better performance of the H-model might be attributed to its consideration of latent HIV reservoirs, probably obtained when leveraging historical information. The findings emphasize the importance of temporal dynamics in mutations, offering insights into HIV infection complexities. However, our result also shows that prediction accuracy remains relatively high even when no historical information is available. Supplementary information: Supplementary material is available.
    Challenging Common Assumptions in Multi-task Learning. (arXiv:2311.04698v1 [cs.LG])
    While multi-task learning (MTL) has gained significant attention in recent years, its underlying mechanisms remain poorly understood. Recent methods did not yield consistent performance improvements over single task learning (STL) baselines, underscoring the importance of gaining more profound insights about challenges specific to MTL. In our study, we challenge common assumptions in MTL in the context of STL: First, the choice of optimizer has only been mildly investigated in MTL. We show the pivotal role of common STL tools such as the Adam optimizer in MTL. We deduce the effectiveness of Adam to its partial loss-scale invariance. Second, the notion of gradient conflicts has often been phrased as a specific problem in MTL. We delve into the role of gradient conflicts in MTL and compare it to STL. For angular gradient alignment we find no evidence that this is a unique problem in MTL. We emphasize differences in gradient magnitude as the main distinguishing factor. Lastly, we compare the transferability of features learned through MTL and STL on common image corruptions, and find no conclusive evidence that MTL leads to superior transferability. Overall, we find surprising similarities between STL and MTL suggesting to consider methods from both fields in a broader context.
    Training CLIP models on Data from Scientific Papers. (arXiv:2311.04711v1 [cs.CV])
    Contrastive Language-Image Pretraining (CLIP) models are able to capture the semantic relationship of images and texts and have enabled a wide range of applications, from image retrieval to classification. These models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores whether limited amounts higher quality data in a specific domain improve the general performance of CLIP models. To this purpose, we extract text-image data from scientific papers hosted in the arXiv and PubMed Central repositories. Experiments on small-scale CLIP models (ViT B/32) show that model performance increases on average, but only moderately. This result indicates that using the data sources considered in the paper to train large-scale CLIP models is a worthwile research direction.
    Massive Editing for Large Language Models via Meta Learning. (arXiv:2311.04661v1 [cs.CL])
    While large language models (LLMs) have enabled learning knowledge from the pre-training corpora, the acquired knowledge may be fundamentally incorrect or outdated over time, which necessitates rectifying the knowledge of the language model (LM) after the training. A promising approach involves employing a hyper-network to generate parameter shift, whereas existing hyper-networks suffer from inferior scalability in synchronous editing operation amount. To mitigate the problem, we propose the MAssive Language Model Editing Network (MALMEN), which formulates the parameter shift aggregation as the least square problem, subsequently updating the LM parameters using the normal equation. To accommodate editing multiple facts simultaneously with limited memory budgets, we separate the computation on the hyper-network and LM, enabling arbitrary batch size on both neural networks. Our method is evaluated by editing up to thousands of facts on LMs with different architectures, i.e., BERT-base, GPT-2, T5-XL (2.8B), and GPT-J (6B), across various knowledge-intensive NLP tasks, i.e., closed book fact-checking and question answering. Remarkably, MALMEN is capable of editing hundreds of times more facts than strong baselines with the identical hyper-network architecture and outperforms editor specifically designed for GPT. Our code is available at https://github.com/ChenmienTan/malmen.
    Investigating Navigation Strategies in the Morris Water Maze through Deep Reinforcement Learning. (arXiv:2306.01066v2 [cs.LG] UPDATED)
    Navigation is a complex skill with a long history of research in animals and humans. In this work, we simulate the Morris Water Maze in 2D to train deep reinforcement learning agents. We perform automatic classification of navigation strategies, analyze the distribution of strategies used by artificial agents, and compare them with experimental data to show similar learning dynamics as those seen in humans and rodents. We develop environment-specific auxiliary tasks and examine factors affecting their usefulness. We suggest that the most beneficial tasks are potentially more biologically feasible for real agents to use. Lastly, we explore the development of internal representations in the activations of artificial agent neural networks. These representations resemble place cells and head-direction cells found in mouse brains, and their presence has correlation to the navigation strategies that artificial agents employ.
    Policy Space Diversity for Non-Transitive Games. (arXiv:2306.16884v2 [cs.GT] UPDATED)
    Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness in existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving in PSRO, we obtain a new PSRO variant, Policy Space Diversity PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on various games demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants.
    Optimized measurements of chaotic dynamical systems via the information bottleneck. (arXiv:2311.04896v1 [cs.LG])
    Deterministic chaos permits a precise notion of a "perfect measurement" as one that, when obtained repeatedly, captures all of the information created by the system's evolution with minimal redundancy. Finding an optimal measurement is challenging, and has generally required intimate knowledge of the dynamics in the few cases where it has been done. We establish an equivalence between a perfect measurement and a variant of the information bottleneck. As a consequence, we can employ machine learning to optimize measurement processes that efficiently extract information from trajectory data. We obtain approximately optimal measurements for multiple chaotic maps and lay the necessary groundwork for efficient information extraction from general time series.
    Euclidean, Projective, Conformal: Choosing a Geometric Algebra for Equivariant Transformers. (arXiv:2311.04744v1 [cs.LG])
    The Geometric Algebra Transformer (GATr) is a versatile architecture for geometric deep learning based on projective geometric algebra. We generalize this architecture into a blueprint that allows one to construct a scalable transformer architecture given any geometric (or Clifford) algebra. We study versions of this architecture for Euclidean, projective, and conformal algebras, all of which are suited to represent 3D data, and evaluate them in theory and practice. The simplest Euclidean architecture is computationally cheap, but has a smaller symmetry group and is not as sample-efficient, while the projective model is not sufficiently expressive. Both the conformal algebra and an improved version of the projective algebra define powerful, performant architectures.
    MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation. (arXiv:2305.15296v2 [cs.CV] UPDATED)
    The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.
    Debiasing Scores and Prompts of 2D Diffusion for View-consistent Text-to-3D Generation. (arXiv:2303.15413v4 [cs.CV] UPDATED)
    Existing score-distilling text-to-3D generation techniques, despite their considerable promise, often encounter the view inconsistency problem. One of the most notable issues is the Janus problem, where the most canonical view of an object (\textit{e.g}., face or head) appears in other views. In this work, we explore existing frameworks for score-distilling text-to-3D generation and identify the main causes of the view inconsistency problem -- the embedded bias of 2D diffusion models. Based on these findings, we propose two approaches to debias the score-distillation frameworks for view-consistent text-to-3D generation. Our first approach, called score debiasing, involves cutting off the score estimated by 2D diffusion models and gradually increasing the truncation value throughout the optimization process. Our second approach, called prompt debiasing, identifies conflicting words between user prompts and view prompts using a language model, and adjusts the discrepancy between view prompts and the viewing direction of an object. Our experimental results show that our methods improve the realism of the generated 3D objects by significantly reducing artifacts and achieve a good trade-off between faithfulness to the 2D diffusion models and 3D consistency with little overhead. Our project page is available at~\url{https://susunghong.github.io/Debiased-Score-Distillation-Sampling/}.
    Data Factors for Better Compositional Generalization. (arXiv:2311.04420v1 [cs.CL])
    Recent diagnostic datasets on compositional generalization, such as SCAN (Lake and Baroni, 2018) and COGS (Kim and Linzen, 2020), expose severe problems in models trained from scratch on these datasets. However, in contrast to this poor performance, state-of-the-art models trained on larger and more general datasets show better generalization ability. In this work, to reconcile this inconsistency, we conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors, including dataset scale, pattern complexity, example difficulty, etc. First, we show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges. To further understand this improvement, we show two axes of the benefit from more complex datasets: they provide more diverse examples so compositional understanding becomes more effective, and they also prevent ungeneralizable memorization of the examples due to reduced example repetition frequency. Finally, we explore how training examples of different difficulty levels influence generalization differently. On synthetic datasets, simple examples invoke stronger compositionality than hard examples do. On larger-scale real language datasets, while hard examples become more important potentially to ensure decent data coverage, a balanced mixture of simple and hard examples manages to induce the strongest generalizability. The code and data for this work are available at https://github.com/owenzx/data4comp
    Evading Watermark based Detection of AI-Generated Content. (arXiv:2305.03807v5 [cs.LG] UPDATED)
    A generative AI model can generate extremely realistic-looking content, posing growing challenges to the authenticity of information. To address the challenges, watermark has been leveraged to detect AI-generated content. Specifically, a watermark is embedded into an AI-generated content before it is released. A content is detected as AI-generated if a similar watermark can be decoded from it. In this work, we perform a systematic study on the robustness of such watermark-based AI-generated content detection. We focus on AI-generated images. Our work shows that an attacker can post-process a watermarked image via adding a small, human-imperceptible perturbation to it, such that the post-processed image evades detection while maintaining its visual quality. We show the effectiveness of our attack both theoretically and empirically. Moreover, to evade detection, our adversarial post-processing method adds much smaller perturbations to AI-generated images and thus better maintain their visual quality than existing popular post-processing methods such as JPEG compression, Gaussian blur, and Brightness/Contrast. Our work shows the insufficiency of existing watermark-based detection of AI-generated content, highlighting the urgent needs of new methods. Our code is publicly available: https://github.com/zhengyuan-jiang/WEvade.
    ETDPC: A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations. (arXiv:2311.04262v1 [cs.CV])
    Electronic theses and dissertations (ETDs) have been proposed, advocated, and generated for more than 25 years. Although ETDs are hosted by commercial or institutional digital library repositories, they are still an understudied type of scholarly big data, partially because they are usually longer than conference proceedings and journals. Segmenting ETDs will allow researchers to study sectional content. Readers can navigate to particular pages of interest, discover, and explore the content buried in these long documents. Most existing frameworks on document page classification are designed for classifying general documents and perform poorly on ETDs. In this paper, we propose ETDPC. Its backbone is a two-stream multimodal model with a cross-attention network to classify ETD pages into 13 categories. To overcome the challenge of imbalanced labeled samples, we augmented data for minority categories and employed a hierarchical classifier. ETDPC outperforms the state-of-the-art models in all categories, achieving an F1 of 0.84 -- 0.96 for 9 out of 13 categories. We also demonstrated its data efficiency. The code and data can be found on GitHub (https://github.com/lamps-lab/ETDMiner/tree/master/etd_segmentation).
    Interpretability, Generalizability, and Memory of Reinforcement Learning Agents in Closed Drafting Games. (arXiv:2310.20654v2 [cs.LG] UPDATED)
    Closed drafting or "pick and pass" is a popular game mechanic where each round players select a card or other playable element from their hand and pass the rest to the next player. In this paper, we establish first-principle interpretability, generalizability, and memory benchmarks for studying model-free reinforcement learning (RL) algorithms playing closed drafting games. Specifically in a popular family of closed drafting games called "Sushi Go Party!", in which we achieve state-of-the-art performance. We fit decision rules to interpret the strategy of trained RL agents and compare these to the ranking preferences of different types of human players, finding easily understandable explanations of the disparate performance of RL agents in this environment. As Sushi Go Party! can be expressed as a set of closely-related games based on the set of cards in play, we quantify the generalizability of RL models trained on various sets of cards, establishing key trends between performance and the set distance between the train and evaluation game configurations. Using the explicitly calculable memory of other player's hands in closed drafting games, we create measures of the ability of RL models to learn memory.
    RoFormer: Enhanced Transformer with Rotary Position Embedding. (arXiv:2104.09864v5 [cs.CL] UPDATED)
    Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Finally, we evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, we provide a theoretical analysis to explain some experimental results. RoFormer is already integrated into Huggingface: \url{https://huggingface.co/docs/transformers/model_doc/roformer}.
    Deep Learning Assisted Multiuser MIMO Load Modulated Systems for Enhanced Downlink mmWave Communications. (arXiv:2311.04537v1 [eess.SP])
    This paper is focused on multiuser load modulation arrays (MU-LMAs) which are attractive due to their low system complexity and reduced cost for millimeter wave (mmWave) multi-input multi-output (MIMO) systems. The existing precoding algorithm for downlink MU-LMA relies on a sub-array structured (SAS) transmitter which may suffer from decreased degrees of freedom and complex system configuration. Furthermore, a conventional LMA codebook with codewords uniformly distributed on a hypersphere may not be channel-adaptive and may lead to increased signal detection complexity. In this paper, we conceive an MU-LMA system employing a full-array structured (FAS) transmitter and propose two algorithms accordingly. The proposed FAS-based system addresses the SAS structural problems and can support larger numbers of users. For LMA-imposed constant-power downlink precoding, we propose an FAS-based normalized block diagonalization (FAS-NBD) algorithm. However, the forced normalization may result in performance degradation. This degradation, together with the aforementioned codebook design problems, is difficult to solve analytically. This motivates us to propose a Deep Learning-enhanced (FAS-DL-NBD) algorithm for adaptive codebook design and codebook-independent decoding. It is shown that the proposed algorithms are robust to imperfect knowledge of channel state information and yield excellent error performance. Moreover, the FAS-DL-NBD algorithm enables signal detection with low complexity as the number of bits per codeword increases.
    A Hierarchical Spatial Transformer for Massive Point Samples in Continuous Space. (arXiv:2311.04434v1 [cs.LG])
    Transformers are widely used deep learning architectures. Existing transformers are mostly designed for sequences (texts or time series), images or videos, and graphs. This paper proposes a novel transformer model for massive (up to a million) point samples in continuous space. Such data are ubiquitous in environment sciences (e.g., sensor observations), numerical simulations (e.g., particle-laden flow, astrophysics), and location-based services (e.g., POIs and trajectories). However, designing a transformer for massive spatial points is non-trivial due to several challenges, including implicit long-range and multi-scale dependency on irregular points in continuous space, a non-uniform point distribution, the potential high computational costs of calculating all-pair attention across massive points, and the risks of over-confident predictions due to varying point density. To address these challenges, we propose a new hierarchical spatial transformer model, which includes multi-resolution representation learning within a quad-tree hierarchy and efficient spatial attention via coarse approximation. We also design an uncertainty quantification branch to estimate prediction confidence related to input feature noise and point sparsity. We provide a theoretical analysis of computational time complexity and memory costs. Extensive experiments on both real-world and synthetic datasets show that our method outperforms multiple baselines in prediction accuracy and our model can scale up to one million points on one NVIDIA A100 GPU. The code is available at \url{https://github.com/spatialdatasciencegroup/HST}.  ( 2 min )
    InstrumentGen: Generating Sample-Based Musical Instruments From Text. (arXiv:2311.04339v1 [eess.AS])
    We introduce the text-to-instrument task, which aims at generating sample-based musical instruments based on textual prompts. Accordingly, we propose InstrumentGen, a model that extends a text-prompted generative audio framework to condition on instrument family, source type, pitch (across an 88-key spectrum), velocity, and a joint text/audio embedding. Furthermore, we present a differentiable loss function to evaluate the intra-instrument timbral consistency of sample-based instruments. Our results establish a foundational text-to-instrument baseline, extending research in the domain of automatic sample-based instrument generation.  ( 2 min )
    SaFL: Sybil-aware Federated Learning with Application to Face Recognition. (arXiv:2311.04346v1 [cs.CV])
    Federated Learning (FL) is a machine learning paradigm to conduct collaborative learning among clients on a joint model. The primary goal is to share clients' local training parameters with an integrating server while preserving their privacy. This method permits to exploit the potential of massive mobile users' data for the benefit of machine learning models' performance while keeping sensitive data on local devices. On the downside, FL raises security and privacy concerns that have just started to be studied. To address some of the key threats in FL, researchers have proposed to use secure aggregation methods (e.g. homomorphic encryption, secure multiparty computation, etc.). These solutions improve some security and privacy metrics, but at the same time bring about other serious threats such as poisoning attacks, backdoor attacks, and free running attacks. This paper proposes a new defense method against poisoning attacks in FL called SaFL (Sybil-aware Federated Learning) that minimizes the effect of sybils with a novel time-variant aggregation scheme.  ( 2 min )
    Class-Incremental Continual Learning for General Purpose Healthcare Models. (arXiv:2311.04301v1 [cs.LG])
    Healthcare clinics regularly encounter dynamic data that changes due to variations in patient populations, treatment policies, medical devices, and emerging disease patterns. Deep learning models can suffer from catastrophic forgetting when fine-tuned in such scenarios, causing poor performance on previously learned tasks. Continual learning allows learning on new tasks without performance drop on previous tasks. In this work, we investigate the performance of continual learning models on four different medical imaging scenarios involving ten classification datasets from diverse modalities, clinical specialties, and hospitals. We implement various continual learning approaches and evaluate their performance in these scenarios. Our results demonstrate that a single model can sequentially learn new tasks from different specialties and achieve comparable performance to naive methods. These findings indicate the feasibility of recycling or sharing models across the same or different medical specialties, offering another step towards the development of general-purpose medical imaging AI that can be shared across institutions.  ( 2 min )
    Holistic Evaluation of Text-To-Image Models. (arXiv:2311.04287v1 [cs.CV])
    The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at https://crfm.stanford.edu/heim/v1.1.0 and the code at https://github.com/stanford-crfm/helm, which is integrated with the HELM codebase.  ( 2 min )
    Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation. (arXiv:2311.04254v1 [cs.AI])
    Recent advancements in Large Language Models (LLMs) have revolutionized decision-making by breaking down complex problems into more manageable language sequences referred to as ``thoughts''. An effective thought design should consider three key perspectives: performance, efficiency, and flexibility. However, existing thought can at most exhibit two of these attributes. To address these limitations, we introduce a novel thought prompting approach called ``Everything of Thoughts'' (XoT) to defy the law of ``Penrose triangle of existing thought paradigms. XoT leverages pretrained reinforcement learning and Monte Carlo Tree Search (MCTS) to incorporate external domain knowledge into thoughts, thereby enhancing LLMs' capabilities and enabling them to generalize to unseen problems efficiently. Through the utilization of the MCTS-LLM collaborative thought revision framework, this approach autonomously produces high-quality comprehensive cognitive mappings with minimal LLM interactions. Additionally, XoT empowers LLMs to engage in unconstrained thinking, allowing for flexible cognitive mappings for problems with multiple solutions.  ( 2 min )
    MixtureGrowth: Growing Neural Networks by Recombining Learned Parameters. (arXiv:2311.04251v1 [cs.LG])
    Most deep neural networks are trained under fixed network architectures and require retraining when the architecture changes. If expanding the network's size is needed, it is necessary to retrain from scratch, which is expensive. To avoid this, one can grow from a small network by adding random weights over time to gradually achieve the target network size. However, this naive approach falls short in practice as it brings too much noise to the growing process. Prior work tackled this issue by leveraging the already learned weights and training data for generating new weights through conducting a computationally expensive analysis step. In this paper, we introduce MixtureGrowth, a new approach to growing networks that circumvents the initialization overhead in prior work. Before growing, each layer in our model is generated with a linear combination of parameter templates. Newly grown layer weights are generated by using a new linear combination of existing templates for a layer. On one hand, these templates are already trained for the task, providing a strong initialization. On the other, the new coefficients provide flexibility for the added layer weights to learn something new. We show that our approach boosts top-1 accuracy over the state-of-the-art by 2-2.5% on CIFAR-100 and ImageNet datasets, while achieving comparable performance with fewer FLOPs to a larger network trained from scratch. Code is available at https://github.com/chaudatascience/mixturegrowth.  ( 3 min )
    Zeroth-order Asynchronous Learning with Bounded Delays with a Use-case in Resource Allocation in Communication Networks. (arXiv:2311.04604v1 [eess.SP])
    Distributed optimization has experienced a significant surge in interest due to its wide-ranging applications in distributed learning and adaptation. While various scenarios, such as shared-memory, local-memory, and consensus-based approaches, have been extensively studied in isolation, there remains a need for further exploration of their interconnections. This paper specifically concentrates on a scenario where agents collaborate toward a unified mission while potentially having distinct tasks. Each agent's actions can potentially impact other agents through interactions. Within this context, the objective for the agents is to optimize their local parameters based on the aggregate of local reward functions, where only local zeroth-order oracles are available. Notably, the learning process is asynchronous, meaning that agents update and query their zeroth-order oracles asynchronously while communicating with other agents subject to bounded but possibly random communication delays. This paper presents theoretical convergence analyses and establishes a convergence rate for the proposed approach. Furthermore, it addresses the relevant issue of deep learning-based resource allocation in communication networks and conducts numerical experiments in which agents, acting as transmitters, collaboratively train their individual (possibly unique) policies to maximize a common performance metric.  ( 2 min )
    Adaptive Mirror Descent Bilevel Optimization. (arXiv:2311.04520v1 [math.OC])
    In the paper, we propose a class of efficient adaptive bilevel methods based on mirror descent for nonconvex bilevel optimization, where its upper-level problem is nonconvex possibly with nonsmooth regularization, and its lower-level problem is also nonconvex while satisfies Polyak-{\L}ojasiewicz (PL) condition. To solve these deterministic bilevel problems, we present an efficient adaptive projection-aid gradient (i.e., AdaPAG) method based on mirror descent, and prove that it obtains the best known gradient complexity of $O(\epsilon^{-1})$ for finding an $\epsilon$-stationary solution of nonconvex bilevel problems. To solve these stochastic bilevel problems, we propose an efficient adaptive stochastic projection-aid gradient (i.e., AdaVSPAG) methods based on mirror descent and variance-reduced techniques, and prove that it obtains the best known gradient complexity of $O(\epsilon^{-3/2})$ for finding an $\epsilon$-stationary solution. Since the PL condition relaxes the strongly convex, our algorithms can be used to nonconvex strongly-convex bilevel optimization. Theoretically, we provide a useful convergence analysis framework for our methods under some mild conditions, and prove that our methods have a fast convergence rate of $O(\frac{1}{T})$, where $T$ denotes the number of iterations.  ( 2 min )
    Compilation of product-formula Hamiltonian simulation via reinforcement learning. (arXiv:2311.04285v1 [quant-ph])
    Hamiltonian simulation is believed to be one of the first tasks where quantum computers can yield a quantum advantage. One of the most popular methods of Hamiltonian simulation is Trotterization, which makes use of the approximation $e^{i\sum_jA_j}\sim \prod_je^{iA_j}$ and higher-order corrections thereto. However, this leaves open the question of the order of operations (i.e. the order of the product over $j$, which is known to affect the quality of approximation). In some cases this order is fixed by the desire to minimise the error of approximation; when it is not the case, we propose that the order can be chosen to optimize compilation to a native quantum architecture. This presents a new compilation problem -- order-agnostic quantum circuit compilation -- which we prove is NP-hard in the worst case. In lieu of an easily-computable exact solution, we turn to methods of heuristic optimization of compilation. We focus on reinforcement learning due to the sequential nature of the compilation task, comparing it to simulated annealing and Monte Carlo tree search. While two of the methods outperform a naive heuristic, reinforcement learning clearly outperforms all others, with a gain of around 12% with respect to the second-best method and of around 50% compared to the naive heuristic in terms of the gate count. We further test the ability of RL to generalize across instances of the compilation problem, and find that a single learner is able to solve entire problem families. This demonstrates the ability of machine learning techniques to provide assistance in an order-agnostic quantum compilation task.  ( 3 min )
    Towards Democratizing AI: A Comparative Analysis of AI as a Service Platforms and the Open Space for Machine Learning Approach. (arXiv:2311.04518v1 [cs.LG])
    Recent AI research has significantly reduced the barriers to apply AI, but the process of setting up the necessary tools and frameworks can still be a challenge. While AI-as-a-Service platforms have emerged to simplify the training and deployment of AI models, they still fall short of achieving true democratization of AI. In this paper, we aim to address this gap by comparing several popular AI-as-a-Service platforms and identifying the key requirements for a platform that can achieve true democratization of AI. Our analysis highlights the need for self-hosting options, high scalability, and openness. To address these requirements, we propose our approach: the "Open Space for Machine Learning" platform. Our platform is built on cutting-edge technologies such as Kubernetes, Kubeflow Pipelines, and Ludwig, enabling us to overcome the challenges of democratizing AI. We argue that our approach is more comprehensive and effective in meeting the requirements of democratizing AI than existing AI-as-a-Service platforms.  ( 2 min )
    CNN-Based Structural Damage Detection using Time-Series Sensor Data. (arXiv:2311.04252v1 [cs.LG])
    Structural Health Monitoring (SHM) is vital for evaluating structural condition, aiming to detect damage through sensor data analysis. It aligns with predictive maintenance in modern industry, minimizing downtime and costs by addressing potential structural issues. Various machine learning techniques have been used to extract valuable information from vibration data, often relying on prior structural knowledge. This research introduces an innovative approach to structural damage detection, utilizing a new Convolutional Neural Network (CNN) algorithm. In order to extract deep spatial features from time series data, CNNs are taught to recognize long-term temporal connections. This methodology combines spatial and temporal features, enhancing discrimination capabilities when compared to methods solely reliant on deep spatial features. Time series data are divided into two categories using the proposed neural network: undamaged and damaged. To validate its efficacy, the method's accuracy was tested using a benchmark dataset derived from a three-floor structure at Los Alamos National Laboratory (LANL). The outcomes show that the new CNN algorithm is very accurate in spotting structural degradation in the examined structure.  ( 2 min )
    Regression with Cost-based Rejection. (arXiv:2311.04550v1 [cs.LG])
    Learning with rejection is an important framework that can refrain from making predictions to avoid critical mispredictions by balancing between prediction and rejection. Previous studies on cost-based rejection only focused on the classification setting, which cannot handle the continuous and infinite target space in the regression setting. In this paper, we investigate a novel regression problem called regression with cost-based rejection, where the model can reject to make predictions on some examples given certain rejection costs. To solve this problem, we first formulate the expected risk for this problem and then derive the Bayes optimal solution, which shows that the optimal model should reject to make predictions on the examples whose variance is larger than the rejection cost when the mean squared error is used as the evaluation metric. Furthermore, we propose to train the model by a surrogate loss function that considers rejection as binary classification and we provide conditions for the model consistency, which implies that the Bayes optimal solution can be recovered by our proposed surrogate loss. Extensive experiments demonstrate the effectiveness of our proposed method.  ( 2 min )
    Graph Neural Networks for Topological Feature Extraction in ECG Classification. (arXiv:2311.04228v1 [eess.SP])
    The electrocardiogram (ECG) is a dependable instrument for assessing the function of the cardiovascular system. There has recently been much emphasis on precisely classifying ECGs. While ECG situations have numerous similarities, little attention has been paid to categorizing ECGs using graph neural networks. In this study, we offer three distinct techniques for classifying heartbeats using deep graph neural networks to classify the ECG signals accurately. We suggest using different methods to extract topological features from the ECG signal and then using a branch of the graph neural network named graph isomorphism network for classifying the ECGs. On the PTB Diagnostics data set, we tested the three proposed techniques. According to the findings, the three proposed techniques are capable of making arrhythmia classification predictions with the accuracy of 99.38, 98.76, and 91.93 percent, respectively.  ( 2 min )
    Compressive Recovery of Sparse Precision Matrices. (arXiv:2311.04673v1 [stat.ML])
    We consider the problem of learning a graph modeling the statistical relations of the $d$ variables of a dataset with $n$ samples $X \in \mathbb{R}^{n \times d}$. Standard approaches amount to searching for a precision matrix $\Theta$ representative of a Gaussian graphical model that adequately explains the data. However, most maximum likelihood-based estimators usually require storing the $d^{2}$ values of the empirical covariance matrix, which can become prohibitive in a high-dimensional setting. In this work, we adopt a compressive viewpoint and aim to estimate a sparse $\Theta$ from a sketch of the data, i.e. a low-dimensional vector of size $m \ll d^{2}$ carefully designed from $X$ using nonlinear random features. Under certain assumptions on the spectrum of $\Theta$ (or its condition number), we show that it is possible to estimate it from a sketch of size $m=\Omega((d+2k)\log(d))$ where $k$ is the maximal number of edges of the underlying graph. These information-theoretic guarantees are inspired by compressed sensing theory and involve restricted isometry properties and instance optimal decoders. We investigate the possibility of achieving practical recovery with an iterative algorithm based on the graphical lasso, viewed as a specific denoiser. We compare our approach and graphical lasso on synthetic datasets, demonstrating its favorable performance even when the dataset is compressed.  ( 2 min )
    Autonomous Advanced Aerial Mobility -- An End-to-end Autonomy Framework for UAVs and Beyond. (arXiv:2311.04472v1 [cs.RO])
    Developing aerial robots that can both safely navigate and execute assigned mission without any human intervention - i.e., fully autonomous aerial mobility of passengers and goods - is the larger vision that guides the research, design, and development efforts in the aerial autonomy space. However, it is highly challenging to concurrently operationalize all types of aerial vehicles that are operating fully autonomously sharing the airspace. Full autonomy of the aerial transportation sector includes several aspects, such as design of the technology that powers the vehicles, operations of multi-agent fleets, and process of certification that meets stringent safety requirements of aviation sector. Thereby, Autonomous Advanced Aerial Mobility is still a vague term and its consequences for researchers and professionals are ambiguous. To address this gap, we present a comprehensive perspective on the emerging field of autonomous advanced aerial mobility, which involves the use of unmanned aerial vehicles (UAVs) and electric vertical takeoff and landing (eVTOL) aircraft for various applications, such as urban air mobility, package delivery, and surveillance. The article proposes a scalable and extensible autonomy framework consisting of four main blocks: sensing, perception, planning, and controls. Furthermore, the article discusses the challenges and opportunities in multi-agent fleet operations and management, as well as the testing, validation, and certification aspects of autonomous aerial systems. Finally, the article explores the potential of monolithic models for aerial autonomy and analyzes their advantages and limitations. The perspective aims to provide a holistic picture of the autonomous advanced aerial mobility field and its future directions.  ( 3 min )
    Predicting Market Value in Professional Soccer: Insights from Explainable Machine Learning Models. (arXiv:2311.04599v1 [cs.LG])
    This study presents an innovative method for predicting the market value of professional soccer players using explainable machine learning models. Using a dataset curated from the FIFA website, we employ an ensemble machine learning approach coupled with Shapley Additive exPlanations (SHAP) to provide detailed explanations of the models' predictions. The GBDT model achieves the highest mean R-Squared (0.8780) and the lowest mean Root Mean Squared Error (3,221,632.175), indicating its superior performance among the evaluated models. Our analysis reveals that specific skills such as ball control, short passing, finishing, interceptions, dribbling, and tackling are paramount within the skill dimension, whereas sprint speed and acceleration are critical in the fitness dimension, and reactions are preeminent in the cognitive dimension. Our results offer a more accurate, objective, and consistent framework for market value estimation, presenting useful insights for managerial decisions in player transfers.  ( 2 min )
    Leveraging sinusoidal representation networks to predict fMRI signals from EEG. (arXiv:2311.04234v1 [eess.SP])
    In modern neuroscience, functional magnetic resonance imaging (fMRI) has been a crucial and irreplaceable tool that provides a non-invasive window into the dynamics of whole-brain activity. Nevertheless, fMRI is limited by hemodynamic blurring as well as high cost, immobility, and incompatibility with metal implants. Electroencephalography (EEG) is complementary to fMRI and can directly record the cortical electrical activity at high temporal resolution, but has more limited spatial resolution and is unable to recover information about deep subcortical brain structures. The ability to obtain fMRI information from EEG would enable cost-effective, imaging across a wider set of brain regions. Further, beyond augmenting the capabilities of EEG, cross-modality models would facilitate the interpretation of fMRI signals. However, as both EEG and fMRI are high-dimensional and prone to artifacts, it is currently challenging to model fMRI from EEG. To address this challenge, we propose a novel architecture that can predict fMRI signals directly from multi-channel EEG without explicit feature engineering. Our model achieves this by implementing a Sinusoidal Representation Network (SIREN) to learn frequency information in brain dynamics from EEG, which serves as the input to a subsequent encoder-decoder to effectively reconstruct the fMRI signal from a specific brain region. We evaluate our model using a simultaneous EEG-fMRI dataset with 8 subjects and investigate its potential for predicting subcortical fMRI signals. The present results reveal that our model outperforms a recent state-of-the-art model, and indicates the potential of leveraging periodic activation functions in deep neural networks to model functional neuroimaging data.  ( 3 min )
    Eigensubspace of Temporal-Difference Dynamics and How It Improves Value Approximation in Reinforcement Learning. (arXiv:2306.16750v2 [cs.LG] UPDATED)
    We propose a novel value approximation method, namely Eigensubspace Regularized Critic (ERC) for deep reinforcement learning (RL). ERC is motivated by an analysis of the dynamics of Q-value approximation error in the Temporal-Difference (TD) method, which follows a path defined by the 1-eigensubspace of the transition kernel associated with the Markov Decision Process (MDP). It reveals a fundamental property of TD learning that has remained unused in previous deep RL approaches. In ERC, we propose a regularizer that guides the approximation error tending towards the 1-eigensubspace, resulting in a more efficient and stable path of value approximation. Moreover, we theoretically prove the convergence of the ERC method. Besides, theoretical analysis and experiments demonstrate that ERC effectively reduces the variance of value functions. Among 26 tasks in the DMControl benchmark, ERC outperforms state-of-the-art methods for 20. Besides, it shows significant advantages in Q-value approximation and variance reduction. Our code is available at https://sites.google.com/view/erc-ecml23/.
    Robotic Learning the Sequence of Packing Irregular Objects from Human Demonstrations. (arXiv:2210.01645v2 [cs.RO] UPDATED)
    We tackle the challenge of robotic bin packing with irregular objects, such as groceries. Given the diverse physical attributes of these objects and the complex constraints governing their placement and manipulation, employing preprogrammed strategies becomes unfeasible. Our approach is to learn directly from expert demonstrations in order to extract implicit task knowledge and strategies to ensure safe object positioning, efficient use of space, and the generation of human-like behaviors that enhance human-robot trust. We rely on human demonstrations to learn a Markov chain for predicting the object packing sequence for a given set of items and then compare it with human performance. Our experimental results show that the model outperforms human performance by generating sequence predictions that humans classify as human-like more frequently than human-generated sequences. The human demonstrations were collected using our proposed VR platform, BoxED, which is a box packaging environment for simulating real-world objects and scenarios for fast and streamlined data collection with the purpose of teaching robots. We collected data from 43 participants packing a total of 263 boxes with supermarket-like objects, yielding 4644 object manipulations. Our VR platform can be easily adapted to new scenarios and objects, and is publicly available, alongside our dataset, at https://github.com/andrejfsantos4/BoxED.
    CLearViD: Curriculum Learning for Video Description. (arXiv:2311.04480v1 [cs.CV])
    Video description entails automatically generating coherent natural language sentences that narrate the content of a given video. We introduce CLearViD, a transformer-based model for video description generation that leverages curriculum learning to accomplish this task. In particular, we investigate two curriculum strategies: (1) progressively exposing the model to more challenging samples by gradually applying a Gaussian noise to the video data, and (2) gradually reducing the capacity of the network through dropout during the training process. These methods enable the model to learn more robust and generalizable features. Moreover, CLearViD leverages the Mish activation function, which provides non-linearity and non-monotonicity and helps alleviate the issue of vanishing gradients. Our extensive experiments and ablation studies demonstrate the effectiveness of the proposed model. The results on two datasets, namely ActivityNet Captions and YouCook2, show that CLearViD significantly outperforms existing state-of-the-art models in terms of both accuracy and diversity metrics.  ( 2 min )
  • Open

    Hierarchical clustering with dot products recovers hidden tree structure. (arXiv:2305.15022v2 [stat.ML] UPDATED)
    In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.
    Functional Bayesian Tucker Decomposition for Continuous-indexed Tensor Data. (arXiv:2311.04829v1 [cs.LG])
    Tucker decomposition is a powerful tensor model to handle multi-aspect data. It demonstrates the low-rank property by decomposing the grid-structured data as interactions between a core tensor and a set of object representations (factors). A fundamental assumption of such decomposition is that there were finite objects in each aspect or mode, corresponding to discrete indexes of data entries. However, many real-world data are not naturally posed in the setting. For example, geographic data is represented as continuous indexes of latitude and longitude coordinates, and cannot fit tensor models directly. To generalize Tucker decomposition to such scenarios, we propose Functional Bayesian Tucker Decomposition (FunBaT). We treat the continuous-indexed data as the interaction between the Tucker core and a group of latent functions. We use Gaussian processes (GP) as functional priors to model the latent functions, and then convert the GPs into a state-space prior by constructing an equivalent stochastic differential equation (SDE) to reduce computational cost. An efficient inference algorithm is further developed for scalable posterior approximation based on advanced message-passing techniques. The advantage of our method is shown in both synthetic data and several real-world applications.
    Versatile Energy-Based Probabilistic Models for High Energy Physics. (arXiv:2302.00695v4 [cs.LG] UPDATED)
    As a classical generative modeling approach, energy-based models have the natural advantage of flexibility in the form of the energy function. Recently, energy-based models have achieved great success in modeling high-dimensional data in computer vision and natural language processing. In line with these advancements, we build a multi-purpose energy-based probabilistic model for High Energy Physics events at the Large Hadron Collider. This framework builds on a powerful generative model and describes higher-order inter-particle interactions. It suits different encoding architectures and builds on implicit generation. As for applicational aspects, it can serve as a powerful parameterized event generator for physics simulation, a generic anomalous signal detector free from spurious correlations, and an augmented event classifier for particle identification.
    Spectral Evolution and Invariance in Linear-width Neural Networks. (arXiv:2211.06506v2 [cs.LG] UPDATED)
    We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invariance of the bulk spectra for both conjugate and neural tangent kernels. We demonstrate similar characteristics when training with stochastic gradient descent with small learning rates. When the learning rate is large, we exhibit the emergence of an outlier whose corresponding eigenvector is aligned with the training data structure. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel matrices exhibit heavy tail behavior. Simple examples are provided to explain when heavy tails can have better generalizations. We exhibit different spectral properties such as invariant bulk, spike, and heavy-tailed distribution from a two-layer neural network using different training strategies, and then correlate them to the feature learning. Analogous phenomena also appear when we train conventional neural networks with real-world data. We conclude that monitoring the evolution of the spectra during training is an essential step toward understanding the training dynamics and feature learning.
    Towards Few-Annotation Learning in Computer Vision: Application to Image Classification and Object Detection tasks. (arXiv:2311.04888v1 [cs.CV])
    In this thesis, we develop theoretical, algorithmic and experimental contributions for Machine Learning with limited labels, and more specifically for the tasks of Image Classification and Object Detection in Computer Vision. In a first contribution, we are interested in bridging the gap between theory and practice for popular Meta-Learning algorithms used in Few-Shot Classification. We make connections to Multi-Task Representation Learning, which benefits from solid theoretical foundations, to verify the best conditions for a more efficient meta-learning. Then, to leverage unlabeled data when training object detectors based on the Transformer architecture, we propose both an unsupervised pretraining and a semi-supervised learning method in two other separate contributions. For pretraining, we improve Contrastive Learning for object detectors by introducing the localization information. Finally, our semi-supervised method is the first tailored to transformer-based detectors.
    Spuriosity Didn't Kill the Classifier: Using Invariant Predictions to Harness Spurious Features. (arXiv:2307.09933v2 [cs.LG] UPDATED)
    To avoid failures on out-of-distribution data, recent works have sought to extract features that have an invariant or stable relationship with the label across domains, discarding "spurious" or unstable features whose relationship with the label changes across domains. However, unstable features often carry complementary information that could boost performance if used correctly in the test domain. In this work, we show how this can be done without test-domain labels. In particular, we prove that pseudo-labels based on stable features provide sufficient guidance for doing so, provided that stable and unstable features are conditionally independent given the label. Based on this theoretical insight, we propose Stable Feature Boosting (SFB), an algorithm for: (i) learning a predictor that separates stable and conditionally-independent unstable features; and (ii) using the stable-feature predictions to adapt the unstable-feature predictions in the test domain. Theoretically, we prove that SFB can learn an asymptotically-optimal predictor without test-domain labels. Empirically, we demonstrate the effectiveness of SFB on real and synthetic data.
    Conditional Sampling of Variational Autoencoders via Iterated Approximate Ancestral Sampling. (arXiv:2308.09078v2 [cs.LG] UPDATED)
    Conditional sampling of variational autoencoders (VAEs) is needed in various applications, such as missing data imputation, but is computationally intractable. A principled choice for asymptotically exact conditional sampling is Metropolis-within-Gibbs (MWG). However, we observe that the tendency of VAEs to learn a structured latent space, a commonly desired property, can cause the MWG sampler to get "stuck" far from the target distribution. This paper mitigates the limitations of MWG: we systematically outline the pitfalls in the context of VAEs, propose two original methods that address these pitfalls, and demonstrate an improved performance of the proposed methods on a set of sampling tasks.
    Statistical limits of correlation detection in trees. (arXiv:2209.13723v2 [math.ST] UPDATED)
    In this paper we address the problem of testing whether two observed trees $(t,t')$ are sampled either independently or from a joint distribution under which they are correlated. This problem, which we refer to as correlation detection in trees, plays a key role in the study of graph alignment for two correlated random graphs. Motivated by graph alignment, we investigate the conditions of existence of one-sided tests, i.e. tests which have vanishing type I error and non-vanishing power in the limit of large tree depth. For the correlated Galton-Watson model with Poisson offspring of mean $\lambda>0$ and correlation parameter $s \in (0,1)$, we identify a phase transition in the limit of large degrees at $s = \sqrt{\alpha}$, where $\alpha \sim 0.3383$ is Otter's constant. Namely, we prove that no such test exists for $s \leq \sqrt{\alpha}$, and that such a test exists whenever $s > \sqrt{\alpha}$, for $\lambda$ large enough. This result sheds new light on the graph alignment problem in the sparse regime (with $O(1)$ average node degrees) and on the performance of the MPAlign method studied in Ganassali et al. (2021), Piccioli et al. (2021), proving in particular the conjecture of Piccioli et al. (2021) that MPAlign succeeds in the partial recovery task for correlation parameter $s>\sqrt{\alpha}$ provided the average node degree $\lambda$ is large enough. As a byproduct, we identify a new family of orthogonal polynomials for the Poisson-Galton-Watson measure which enjoy remarkable properties. These polynomials may be of independent interest for a variety of problems involving graphs, trees or branching processes, beyond the scope of graph alignment.
    Learning Linear Gaussian Polytree Models with Interventions. (arXiv:2311.04636v1 [stat.ML])
    We present a consistent and highly scalable local approach to learn the causal structure of a linear Gaussian polytree using data from interventional experiments with known intervention targets. Our methods first learn the skeleton of the polytree and then orient its edges. The output is a CPDAG representing the interventional equivalence class of the polytree of the true underlying distribution. The skeleton and orientation recovery procedures we use rely on second order statistics and low-dimensional marginal distributions. We assess the performance of our methods under different scenarios in synthetic data sets and apply our algorithm to learn a polytree in a gene expression interventional data set. Our simulation studies demonstrate that our approach is fast, has good accuracy in terms of structural Hamming distance, and handles problems with thousands of nodes.
    Solving Kernel Ridge Regression with Gradient-Based Optimization Methods. (arXiv:2306.16838v3 [stat.ML] UPDATED)
    Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. Here, we introduce an equivalent formulation of the objective function of KRR, opening up both for using penalties other than the ridge penalty and for studying kernel ridge regression from the perspective of gradient descent. Using a continuous-time perspective, we derive a closed-form solution for solving kernel regression with gradient descent, something we refer to as kernel gradient flow, KGF, and theoretically bound the differences between KRR and KGF, where, for the latter, regularization is obtained through early stopping. We also generalize KRR by replacing the ridge penalty with the $\ell_1$ and $\ell_\infty$ penalties, respectively, and use the fact that analogous to the similarities between KGF and KRR, $\ell_1$ regularization and forward stagewise regression (also known as coordinate descent), and $\ell_\infty$ regularization and sign gradient descent, follow similar solution paths. We can thus alleviate the need for computationally heavy algorithms based on proximal gradient descent. We show theoretically and empirically how the $\ell_1$ and $\ell_\infty$ penalties, and the corresponding gradient-based optimization algorithms, produce sparse and robust kernel regression solutions, respectively.
    Robust and Communication-Efficient Federated Domain Adaptation via Random Features. (arXiv:2311.04686v1 [cs.LG])
    Modern machine learning (ML) models have grown to a scale where training them on a single machine becomes impractical. As a result, there is a growing trend to leverage federated learning (FL) techniques to train large ML models in a distributed and collaborative manner. These models, however, when deployed on new devices, might struggle to generalize well due to domain shifts. In this context, federated domain adaptation (FDA) emerges as a powerful approach to address this challenge. Most existing FDA approaches typically focus on aligning the distributions between source and target domains by minimizing their (e.g., MMD) distance. Such strategies, however, inevitably introduce high communication overheads and can be highly sensitive to network reliability. In this paper, we introduce RF-TCA, an enhancement to the standard Transfer Component Analysis approach that significantly accelerates computation without compromising theoretical and empirical performance. Leveraging the computational advantage of RF-TCA, we further extend it to FDA setting with FedRF-TCA. The proposed FedRF-TCA protocol boasts communication complexity that is \emph{independent} of the sample size, while maintaining performance that is either comparable to or even surpasses state-of-the-art FDA methods. We present extensive experiments to showcase the superior performance and robustness (to network condition) of FedRF-TCA.
    Implicit models, latent compression, intrinsic biases, and cheap lunches in community detection. (arXiv:2210.09186v7 [cs.SI] UPDATED)
    The task of community detection, which aims to partition a network into clusters of nodes to summarize its large-scale structure, has spawned the development of many competing algorithms with varying objectives. Some community detection methods are inferential, explicitly deriving the clustering objective through a probabilistic generative model, while other methods are descriptive, dividing a network according to an objective motivated by a particular application, making it challenging to compare these methods on the same scale. Here we present a solution to this problem that associates any community detection objective, inferential or descriptive, with its corresponding implicit network generative model. This allows us to compute the description length of a network and its partition under arbitrary objectives, providing a principled measure to compare the performance of different algorithms without the need for "ground truth" labels. Our approach also gives access to instances of the community detection problem that are optimal to any given algorithm, and in this way reveals intrinsic biases in popular descriptive methods, explaining their tendency to overfit. Using our framework, we compare a number of community detection methods on artificial networks, and on a corpus of over 500 structurally diverse empirical networks. We find that more expressive community detection methods exhibit consistently superior compression performance on structured data instances, without having degraded performance on a minority of situations where more specialized algorithms perform optimally. Our results undermine the implications of the "no free lunch" theorem for community detection, both conceptually and in practice, since it is confined to unstructured data instances, unlike relevant community detection problems which are structured by requirement.
    Zero-Shot Anomaly Detection via Batch Normalization. (arXiv:2302.07849v4 [cs.LG] UPDATED)
    Anomaly detection (AD) plays a crucial role in many safety-critical application domains. The challenge of adapting an anomaly detector to drift in the normal data distribution, especially when no training data is available for the "new normal," has led to the development of zero-shot AD techniques. In this paper, we propose a simple yet effective method called Adaptive Centered Representations (ACR) for zero-shot batch-level AD. Our approach trains off-the-shelf deep anomaly detectors (such as deep SVDD) to adapt to a set of inter-related training data distributions in combination with batch normalization, enabling automatic zero-shot generalization for unseen AD tasks. This simple recipe, batch normalization plus meta-training, is a highly effective and versatile tool. Our theoretical results guarantee the zero-shot generalization for unseen AD tasks; our empirical results demonstrate the first zero-shot AD results for tabular data and outperform existing methods in zero-shot anomaly detection and segmentation on image data from specialized domains. Code is at https://github.com/aodongli/zero-shot-ad-via-batch-norm
    Compressive Recovery of Sparse Precision Matrices. (arXiv:2311.04673v1 [stat.ML])
    We consider the problem of learning a graph modeling the statistical relations of the $d$ variables of a dataset with $n$ samples $X \in \mathbb{R}^{n \times d}$. Standard approaches amount to searching for a precision matrix $\Theta$ representative of a Gaussian graphical model that adequately explains the data. However, most maximum likelihood-based estimators usually require storing the $d^{2}$ values of the empirical covariance matrix, which can become prohibitive in a high-dimensional setting. In this work, we adopt a compressive viewpoint and aim to estimate a sparse $\Theta$ from a sketch of the data, i.e. a low-dimensional vector of size $m \ll d^{2}$ carefully designed from $X$ using nonlinear random features. Under certain assumptions on the spectrum of $\Theta$ (or its condition number), we show that it is possible to estimate it from a sketch of size $m=\Omega((d+2k)\log(d))$ where $k$ is the maximal number of edges of the underlying graph. These information-theoretic guarantees are inspired by compressed sensing theory and involve restricted isometry properties and instance optimal decoders. We investigate the possibility of achieving practical recovery with an iterative algorithm based on the graphical lasso, viewed as a specific denoiser. We compare our approach and graphical lasso on synthetic datasets, demonstrating its favorable performance even when the dataset is compressed.
    Multi-Source Domain Adaptation through Dataset Dictionary Learning in Wasserstein Space. (arXiv:2307.14953v3 [cs.LG] UPDATED)
    This paper seeks to solve Multi-Source Domain Adaptation (MSDA), which aims to mitigate data distribution shifts when transferring knowledge from multiple labeled source domains to an unlabeled target domain. We propose a novel MSDA framework based on dictionary learning and optimal transport. We interpret each domain in MSDA as an empirical distribution. As such, we express each domain as a Wasserstein barycenter of dictionary atoms, which are empirical distributions. We propose a novel algorithm, DaDiL, for learning via mini-batches: (i) atom distributions; (ii) a matrix of barycentric coordinates. Based on our dictionary, we propose two novel methods for MSDA: DaDil-R, based on the reconstruction of labeled samples in the target domain, and DaDiL-E, based on the ensembling of classifiers learned on atom distributions. We evaluate our methods in 3 benchmarks: Caltech-Office, Office 31, and CRWU, where we improved previous state-of-the-art by 3.15%, 2.29%, and 7.71% in classification performance. Finally, we show that interpolations in the Wasserstein hull of learned atoms provide data that can generalize to the target domain.
    Causal disentanglement of multimodal data. (arXiv:2310.18471v2 [cs.LG] UPDATED)
    Causal representation learning algorithms discover lower-dimensional representations of data that admit a decipherable interpretation of cause and effect; as achieving such interpretable representations is challenging, many causal learning algorithms utilize elements indicating prior information, such as (linear) structural causal models, interventional data, or weak supervision. Unfortunately, in exploratory causal representation learning, such elements and prior information may not be available or warranted. Alternatively, scientific datasets often have multiple modalities or physics-based constraints, and the use of such scientific, multimodal data has been shown to improve disentanglement in fully unsupervised settings. Consequently, we introduce a causal representation learning algorithm (causalPIMA) that can use multimodal data and known physics to discover important features with causal relationships. Our innovative algorithm utilizes a new differentiable parametrization to learn a directed acyclic graph (DAG) together with a latent space of a variational autoencoder in an end-to-end differentiable framework via a single, tractable evidence lower bound loss function. We place a Gaussian mixture prior on the latent space and identify each of the mixtures with an outcome of the DAG nodes; this novel identification enables feature discovery with causal relationships. Tested against a synthetic and a scientific dataset, our results demonstrate the capability of learning an interpretable causal structure while simultaneously discovering key features in a fully unsupervised setting.  ( 2 min )
    21cmEMU: an emulator of 21cmFAST summary observables. (arXiv:2309.05697v1 [astro-ph.CO] CROSS LISTED)
    Recent years have witnessed rapid progress in observations of the Epoch of Reionization (EoR). These have enabled high-dimensional inference of galaxy and intergalactic medium (IGM) properties during the first billion years of our Universe. However, even using efficient, semi-numerical simulations, traditional inference approaches that compute 3D lightcones on-the-fly can take $10^5$ core hours. Here we present 21cmEMU: an emulator of several summary observables from the popular 21cmFAST simulation code. 21cmEMU takes as input nine parameters characterizing EoR galaxies, and outputs the following summary statistics: (i) the IGM mean neutral fraction; (ii) the 21-cm power spectrum; (iii) the mean 21-cm spin temperature; (iv) the sky-averaged (global) 21-cm signal; (vi) the ultraviolet (UV) luminosity functions (LFs); and (vii) the Thomson scattering optical depth to the cosmic microwave background (CMB). All observables are predicted with sub-percent median accuracy, with a reduction of the computational cost by a factor of over 10$^4$. After validating inference results, we showcase a few applications, including: (i) quantifying the relative constraining power of different observational datasets; (ii) seeing how recent claims of a late EoR impact previous inferences; and (iii) forecasting upcoming constraints from the sixth observing season of the Hydrogen Epoch of Reionization Array (HERA) telescope. 21cmEMU is publicly-available, and is included as an alternative simulator in the public 21CMMC sampler.  ( 2 min )
    Joint control variate for faster black-box variational inference. (arXiv:2210.07290v3 [cs.LG] UPDATED)
    Black-box variational inference performance is sometimes hindered by the use of gradient estimators with high variance. This variance comes from two sources of randomness: Data subsampling and Monte Carlo sampling. While existing control variates only address Monte Carlo noise, and incremental gradient methods typically only address data subsampling, we propose a new "joint" control variate that jointly reduces variance from both sources of noise. This significantly reduces gradient variance, leading to faster optimization in several applications.  ( 2 min )
    Towards a Unified Framework of Contrastive Learning for Disentangled Representations. (arXiv:2311.04774v1 [cs.LG])
    Contrastive learning has recently emerged as a promising approach for learning data representations that discover and disentangle the explanatory factors of the data. Previous analyses of such approaches have largely focused on individual contrastive losses, such as noise-contrastive estimation (NCE) and InfoNCE, and rely on specific assumptions about the data generating process. This paper extends the theoretical guarantees for disentanglement to a broader family of contrastive methods, while also relaxing the assumptions about the data distribution. Specifically, we prove identifiability of the true latents for four contrastive losses studied in this paper, without imposing common independence assumptions. The theoretical findings are validated on several benchmark datasets. Finally, practical limitations of these methods are also investigated.  ( 2 min )
    Information-Theoretic Generalization Bounds for Transductive Learning and its Applications. (arXiv:2311.04561v1 [cs.LG])
    In this paper, we develop data-dependent and algorithm-dependent generalization bounds for transductive learning algorithms in the context of information theory for the first time. We show that the generalization gap of transductive learning algorithms can be bounded by the mutual information between training labels and hypothesis. By innovatively proposing the concept of transductive supersamples, we go beyond the inductive learning setting and establish upper bounds in terms of various information measures. Furthermore, we derive novel PAC-Bayesian bounds and build the connection between generalization and loss landscape flatness under the transductive learning setting. Finally, we present the upper bounds for adaptive optimization algorithms and demonstrate the applications of results on semi-supervised learning and graph learning scenarios. Our theoretic results are validated on both synthetic and real-world datasets.  ( 2 min )
    Why Do Clinical Probabilistic Models Fail To Transport Between Sites?. (arXiv:2311.04787v1 [cs.LG])
    The rising popularity of artificial intelligence in healthcare is highlighting the problem that a computational model achieving super-human clinical performance at its training sites may perform substantially worse at new sites. In this perspective, we present common sources for this failure to transport, which we divide into sources under the control of the experimenter and sources inherent to the clinical data-generating process. Of the inherent sources we look a little deeper into site-specific clinical practices that can affect the data distribution, and propose a potential solution intended to isolate the imprint of those practices on the data from the patterns of disease cause and effect that are the usual target of clinical models.  ( 2 min )
    More PAC-Bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime-validity. (arXiv:2306.12214v2 [stat.ML] UPDATED)
    In this paper, we present new high-probability PAC-Bayes bounds for different types of losses. Firstly, for losses with a bounded range, we recover a strengthened version of Catoni's bound that holds uniformly for all parameter values. This leads to new fast rate and mixed rate bounds that are interpretable and tighter than previous bounds in the literature. In particular, the fast rate bound is equivalent to the Seeger--Langford bound. Secondly, for losses with more general tail behaviors, we introduce two new parameter-free bounds: a PAC-Bayes Chernoff analogue when the loss' cumulative generating function is bounded, and a bound when the loss' second moment is bounded. These two bounds are obtained using a new technique based on a discretization of the space of possible events for the "in probability" parameter optimization problem. This technique is both simpler and more general than previous approaches optimizing over a grid on the parameters' space. Finally, we extend all previous results to anytime-valid bounds using a simple technique applicable to any existing bound.  ( 2 min )
    Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How. (arXiv:2311.04898v1 [cs.LG])
    Recent years have seen considerable progress in the continual training of deep neural networks, predominantly thanks to approaches that add replay or regularization terms to the loss function to approximate the joint loss over all tasks so far. However, we show that even with a perfect approximation to the joint loss, these approaches still suffer from temporary but substantial forgetting when starting to train on a new task. Motivated by this 'stability gap', we propose that continual learning strategies should focus not only on the optimization objective, but also on the way this objective is optimized. While there is some continual learning work that alters the optimization trajectory (e.g., using gradient projection techniques), this line of research is positioned as alternative to improving the optimization objective, while we argue it should be complementary. To evaluate the merits of our proposition, we plan to combine replay-approximated joint objectives with gradient projection-based optimization routines to test whether the addition of the latter provides benefits in terms of (1) alleviating the stability gap, (2) increasing the learning efficiency and (3) improving the final learning outcome.  ( 3 min )
    Regression with Cost-based Rejection. (arXiv:2311.04550v1 [cs.LG])
    Learning with rejection is an important framework that can refrain from making predictions to avoid critical mispredictions by balancing between prediction and rejection. Previous studies on cost-based rejection only focused on the classification setting, which cannot handle the continuous and infinite target space in the regression setting. In this paper, we investigate a novel regression problem called regression with cost-based rejection, where the model can reject to make predictions on some examples given certain rejection costs. To solve this problem, we first formulate the expected risk for this problem and then derive the Bayes optimal solution, which shows that the optimal model should reject to make predictions on the examples whose variance is larger than the rejection cost when the mean squared error is used as the evaluation metric. Furthermore, we propose to train the model by a surrogate loss function that considers rejection as binary classification and we provide conditions for the model consistency, which implies that the Bayes optimal solution can be recovered by our proposed surrogate loss. Extensive experiments demonstrate the effectiveness of our proposed method.  ( 2 min )
    Assessment of the Reliablity of a Model's Decision by Generalizing Attribution to the Wavelet Domain. (arXiv:2305.14979v4 [cs.CV] UPDATED)
    Neural networks have shown remarkable performance in computer vision, but their deployment in numerous scientific and technical fields is challenging due to their black-box nature. Scientists and practitioners need to evaluate the reliability of a decision, i.e., to know simultaneously if a model relies on the relevant features and whether these features are robust to image corruptions. Existing attribution methods aim to provide human-understandable explanations by highlighting important regions in the image domain, but fail to fully characterize a decision process's reliability. To bridge this gap, we introduce the Wavelet sCale Attribution Method (WCAM), a generalization of attribution from the pixel domain to the space-scale domain using wavelet transforms. Attribution in the wavelet domain reveals where and on what scales the model focuses, thus enabling us to assess whether a decision is reliable. Our code is accessible here: \url{https://github.com/gabrielkasmi/spectral-attribution}.  ( 2 min )
    Causal Scoring: A Framework for Effect Estimation, Effect Ordering, and Effect Classification. (arXiv:2206.12532v3 [stat.ML] UPDATED)
    This paper introduces causal scoring as a novel approach to frame causal estimation in the context of decision making. Causal scoring entails the estimation of scores that support decision making by providing insights into causal effects. We present three valuable causal interpretations of these scores: effect estimation (EE), effect ordering (EO), and effect classification (EC). In the EE interpretation, the causal score represents the effect itself. The EO interpretation implies that the score can serve as a proxy for the magnitude of the effect, enabling the sorting of individuals based on their causal effects. The EC interpretation enables the classification of individuals into high- and low-effect categories using a predefined threshold. We demonstrate the value of these alternative causal interpretations (EO and EC) through two key results. First, we show that aligning the statistical modeling with the desired causal interpretation improves the accuracy of causal estimation. Second, we establish that more flexible causal interpretations are plausible in a wider range of data-generating processes and propose conditions to assess their validity. We showcase the practical utility of the causal scoring framework through examples in diverse fields such as advertising, healthcare, and education, illustrating how it facilitates reasoning about flexible causal interpretations of statistical estimates in various contexts. The examples encompass confounded estimates, effect estimates on surrogate outcomes, and even predictions about non-causal quantities as potential causal scores.  ( 3 min )
    Certified Data Removal from Machine Learning Models. (arXiv:1911.03030v6 [cs.LG] UPDATED)
    Good data stewardship requires removal of data at the request of the data's owner. This raises the question if and how a trained machine-learning model, which implicitly stores information about its training data, should be affected by such a removal request. Is it possible to "remove" data from a machine-learning model? We study this problem by defining certified removal: a very strong theoretical guarantee that a model from which data is removed cannot be distinguished from a model that never observed the data to begin with. We develop a certified-removal mechanism for linear classifiers and empirically study learning settings in which this mechanism is practical.  ( 2 min )
    Data fission: splitting a single data point. (arXiv:2112.11079v8 [stat.ME] UPDATED)
    Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2022) offers an alternative route of accomplishing this task through randomization of $X$ with additive Gaussian noise which enables post-selection inference in finite samples for Gaussian distributed data and asymptotically for non-Gaussian additive models. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.  ( 3 min )
    Robust Mean Estimation Without Moments for Symmetric Distributions. (arXiv:2302.10844v2 [cs.DS] UPDATED)
    We study the problem of robustly estimating the mean or location parameter without moment assumptions. We show that for a large class of symmetric distributions, the same error as in the Gaussian setting can be achieved efficiently. The distributions we study include products of arbitrary symmetric one-dimensional distributions, such as product Cauchy distributions, as well as elliptical distributions. For product distributions and elliptical distributions with known scatter (covariance) matrix, we show that given an $\varepsilon$-corrupted sample, we can with probability at least $1-\delta$ estimate its location up to error $O(\varepsilon \sqrt{\log(1/\varepsilon)})$ using $\tfrac{d\log(d) + \log(1/\delta)}{\varepsilon^2 \log(1/\varepsilon)}$ samples. This result matches the best-known guarantees for the Gaussian distribution and known SQ lower bounds (up to the $\log(d)$ factor). For elliptical distributions with unknown scatter (covariance) matrix, we propose a sequence of efficient algorithms that approaches this optimal error. Specifically, for every $k \in \mathbb{N}$, we design an estimator using time and samples $\tilde{O}({d^k})$ achieving error $O(\varepsilon^{1-\frac{1}{2k}})$. This matches the error and running time guarantees when assuming certifiably bounded moments of order up to $k$. For unknown covariance, such error bounds of $o(\sqrt{\varepsilon})$ are not even known for (general) sub-Gaussian distributions. Our algorithms are based on a generalization of the well-known filtering technique. We show how this machinery can be combined with Huber-loss-based techniques to work with projections of the noise that behave more nicely than the initial noise. Moreover, we show how SoS proofs can be used to obtain algorithmic guarantees even for distributions without a first moment. We believe that this approach may find other applications in future works.  ( 3 min )
    Natural Bayesian Cram\'er-Rao Bound with an Application to Covariance Estimation. (arXiv:2311.04748v1 [math.ST])
    In this paper, we propose to develop a new Cram\'er-Rao Bound (CRB) when the parameter to estimate lies in a manifold and follows a prior distribution. This derivation leads to a natural inequality between an error criteria based on geometrical properties and this new bound. This main contribution is illustrated in the problem of covariance estimation when the data follow a Gaussian distribution and the prior distribution is an inverse Wishart. Numerical simulation shows new results where the proposed CRB allows to exhibit interesting properties of the MAP estimator which are not observed with the classical Bayesian CRB.  ( 2 min )
    Optimal Deep Neural Network Approximation for Korobov Functions with respect to Sobolev Norms. (arXiv:2311.04779v1 [math.NA])
    This paper establishes the nearly optimal rate of approximation for deep neural networks (DNNs) when applied to Korobov functions, effectively overcoming the curse of dimensionality. The approximation results presented in this paper are measured with respect to $L_p$ norms and $H^1$ norms. Our achieved approximation rate demonstrates a remarkable "super-convergence" rate, outperforming traditional methods and any continuous function approximator. These results are non-asymptotic, providing error bounds that consider both the width and depth of the networks simultaneously.  ( 2 min )
    Robust Best-arm Identification in Linear Bandits. (arXiv:2311.04731v1 [cs.LG])
    We study the robust best-arm identification problem (RBAI) in the case of linear rewards. The primary objective is to identify a near-optimal robust arm, which involves selecting arms at every round and assessing their robustness by exploring potential adversarial actions. This approach is particularly relevant when utilizing a simulator and seeking to identify a robust solution for real-world transfer. To this end, we present an instance-dependent lower bound for the robust best-arm identification problem with linear rewards. Furthermore, we propose both static and adaptive bandit algorithms that achieve sample complexity that matches the lower bound. In synthetic experiments, our algorithms effectively identify the best robust arm and perform similarly to the oracle strategy. As an application, we examine diabetes care and the process of learning insulin dose recommendations that are robust with respect to inaccuracies in standard calculators. Our algorithms prove to be effective in identifying robust dosage values across various age ranges of patients.  ( 2 min )
    On the estimation of the number of components in multivariate functional principal component analysis. (arXiv:2311.04540v1 [stat.ME])
    Happ and Greven (2018) developed a methodology for principal components analysis of multivariate functional data for data observed on different dimensional domains. Their approach relies on an estimation of univariate functional principal components for each univariate functional feature. In this paper, we present extensive simulations to investigate choosing the number of principal components to retain. We show empirically that the conventional approach of using a percentage of variance explained threshold for each univariate functional feature may be unreliable when aiming to explain an overall percentage of variance in the multivariate functional data, and thus we advise practitioners to be careful when using it.  ( 2 min )
    Likelihood Ratio Confidence Sets for Sequential Decision Making. (arXiv:2311.04402v1 [cs.LG])
    Certifiable, adaptive uncertainty estimates for unknown quantities are an essential ingredient of sequential decision-making algorithms. Standard approaches rely on problem-dependent concentration results and are limited to a specific combination of parameterization, noise family, and estimator. In this paper, we revisit the likelihood-based inference principle and propose to use likelihood ratios to construct any-time valid confidence sequences without requiring specialized treatment in each application scenario. Our method is especially suitable for problems with well-specified likelihoods, and the resulting sets always maintain the prescribed coverage in a model-agnostic manner. The size of the sets depends on a choice of estimator sequence in the likelihood ratio. We discuss how to provably choose the best sequence of estimators and shed light on connections to online convex optimization with algorithms such as Follow-the-Regularized-Leader. To counteract the initially large bias of the estimators, we propose a reweighting scheme that also opens up deployment in non-parametric settings such as RKHS function classes. We provide a non-asymptotic analysis of the likelihood ratio confidence sets size for generalized linear models, using insights from convex duality and online learning. We showcase the practical strength of our method on generalized linear bandit problems, survival analysis, and bandits with various additive noise distributions.  ( 2 min )
    Online Learning Quantum States with the Logarithmic Loss via VB-FTRL. (arXiv:2311.04237v1 [quant-ph])
    Online learning quantum states with the logarithmic loss (LL-OLQS) is a quantum generalization of online portfolio selection, a classic open problem in the field of online learning for over three decades. The problem also emerges in designing randomized optimization algorithms for maximum-likelihood quantum state tomography. Recently, Jezequel et al. (arXiv:2209.13932) proposed the VB-FTRL algorithm, the first nearly regret-optimal algorithm for OPS with moderate computational complexity. In this note, we generalize VB-FTRL for LL-OLQS. Let $d$ denote the dimension and $T$ the number of rounds. The generalized algorithm achieves a regret rate of $O ( d^2 \log ( d + T ) )$ for LL-OLQS. Each iteration of the algorithm consists of solving a semidefinite program that can be implemented in polynomial time by, e.g., cutting-plane methods. For comparison, the best-known regret rate for LL-OLQS is currently $O ( d^2 \log T )$, achieved by the exponential weight method. However, there is no explicit implementation available for the exponential weight method for LL-OLQS. To facilitate the generalization, we introduce the notion of VB-convexity. VB-convexity is a sufficient condition for the logarithmic barrier associated with any function to be convex and is of independent interest.  ( 2 min )

  • Open

    [D] How can you add additional features/attributes while doing instance segmentation?
    I want to do an instance segmentation of objects in images. Usually I would stick to something like an Mask R CNN and let it run. However additionally to the image itself and the pre-labeled images, I have additional features that might be interesting for the segmentation. Example: I want to segment certain products in images from a factory and I have additional information about the products than run at a specific time (like product family, color, brand, etc.). How do I add these additional features in an instance segmentation? submitted by /u/phillylovesdata [link] [comments]  ( 9 min )
    [P] Pandas Vs Tensorflow for reading dataset
    Background info: Greetings. I am a student who attentds computer science uni and as part of my dissertation I have to train some models. The thing is,I'm quite new with machine learning and my knowledge is limited so far. Main: I'm trying to open a 10gb dataset in Google colab to sanitize and preprocess the data before feeding them into a CNN model and I don't know which is the best way to do it. Thanks for your time submitted by /u/ThrowRA39495 [link] [comments]  ( 9 min )
    [P]Coqui released XTTSv2
    XTTSv2 is released. I’d say it’s a big jump in quality. Better voice cloning Better audio Impressive prosody and expressiveness Added more languages, I guess total 16 languages. Non-EN languages sounds way better Streaming under 200ms ( I have 3090) Finetuning code Here you can try https://huggingface.co/spaces/coqui/xtts submitted by /u/coinfelix [link] [comments]  ( 9 min )
    [D] Is it a good idea to use VGG16 for a image classification mobile app?
    I'm new to image classification and ML and this is going to be my first project on those topics. I'm considering using VGG16 because I saw some studies showing that it has a generally great accuracy score (80-95%) but I'm worried that the model might not be fast enough or the app file size might get massive if I want the app to be usable without internet connection. What do you guys think? submitted by /u/Rhet98 [link] [comments]  ( 9 min )
    [P] GPT vs. StarCraft
    This is the first in a series of webcasts covering the development and experimentation of using GPT algorithms, LangChain and Python to control the high-level strategy of a StarCraft II bot. I’ll be running through the basics of the implementation, discussing the use of prompts and prompt engineering, and demonstrating the implementation in action. https://youtu.be/E3Sj2L6ZnXA submitted by /u/Resident-Weather-324 [link] [comments]  ( 9 min )
    [D] transformers Trainer log
    I'm trying to finetune an LLM with LoRA, using transformers' Trainer. In twenty minutes of training it didn't output any logs to screen or to disk, even though I set logging_steps to 1: trainer = transformers.Trainer( ... args = transformers.TrainingArguments( ... logging_steps=1, logging_dir="logs", ), ) What do I need to do to see the training log? Set up any callbacks? submitted by /u/Foxtr0t [link] [comments]  ( 9 min )
    Doing Molecular Dynamics Simulations using GNN's [D]
    I've been trying to make a gnn that can do molecular dynamics simulations on some decently simple molecules. While I have experience with ml, I'm pretty new to the molecular dynamics part and found that to be pretty confusing. Can someone please point me to some resources that cover this topic? submitted by /u/The_Invincible7 [link] [comments]  ( 9 min )
    [D] How To Do Product Matching Using ML?
    Hi Folks, I am trying to build a solution where I input an e-commerce product URL and automatically get all the competitors for that product. If you could provide any direction concerning it. It will be beneficial. submitted by /u/Used-Preparation-921 [link] [comments]  ( 9 min )
    [D] What AI topics are you curious about but rarely see in the spotlight?
    I'm a data engineer who somehow ended up as a software developer. So many of my friends are working now with the OpenAI api to add generative capabilities to their product, but they lack A LOT of context when it comes to how LLMs actually works. This is why I started writing popular-science style articles that unpack AI concepts for software developers working on real-world application. It started kind of slow, honestly I wrote a bit too "brainy" for them, but now I've found a voice that resonance with this audience much better and I want to ramp up my writing cadence. I would love to hear your thoughts about what concepts I should write about next? What get you excited and you find hard to explain to someone with a different background? submitted by /u/GratefullyFriendly73 [link] [comments]  ( 9 min )
    [P] Fine-grained semantic search and clustering with interpretable multi-feature text embeddings
    Hi, we all know that text embeddings (e.g., SBERT, simCSE, LLM embeddings) are very powerful. However, my little grudge with them was always that it's hard to say what's really in them. Okay, matching them gives some value of "relatedness" or "similarity", but the value is kind of really hard to interpret. I mean text can be really diverse and is often similar in some categories, but not in others. Here's an example: "The man builds a tent" "Two men build a tent" A text embedding model such as SBERT gives a high similarity score, which is fine, since the sentences are in fact quite similar. However, they're similar because they're mostly on the same stuff/topics, but they're dissimilar in their use of number: in the first sentence there's one man, in the second sentence there's two! …  ( 10 min )
    [D] T5-base results are worse than t5-small
    Hi everyone, I pretrained T5 small, base and large on the [PrivaSeer](https://privaseer.ist.psu.edu/data) corpus with a spanned MLM objective. I called the pretrained model PrivaT5. Then finetuned PrivaT5 and T5 small, base and large on some tasks of the [PrivacyGLUE](https://github.com/infsys-lab/privacy-glue) benchmark. You can see the results in these plots: https://preview.redd.it/vzq1l2jdqazb1.png?width=1280&format=png&auto=webp&s=f73239ecf59f409ec1371a59e52599e89d89c97b For all model sizes I used the same hyperparameters except for the batch size I changed it to make the model fit on the TPU. Example : ​ ``` --model_name_or_path="t5-base" --hub_save_name_or_path="t5-base" --model_type="t5-base" --config_name="t5-base" --tokenizer_name="t5-base" --max_seq_length="512" --per_device_train_batch_size="16" --per_device_eval_batch_size="16" --adafactor --learning_rate="0.001" --weight_decay="0.0" --warmup_steps="0" --overwrite_output_dir --logging_steps="500" --save_steps="50" --eval_steps="50" --num_train_epochs="100" ``` Could anyone give me possible reasons why the PrivaT5 base performance unexpectedly drops on the OPP-115 and Policy-Detection tasks compared to PrivaT5 small? (Multilabel text classification & binary text classification respectively). Thank you! submitted by /u/alzoubi36 [link] [comments]  ( 9 min )
    Best Data Visualization Technique for a Multilabel Classification Data? [Discussion]
    Would like to ask if there are any other Data Visualization Techniques or Tools to visualize a Multilabel Classification Data? One of the many ways that we can visualize a high dimensional dataset is through a correlation heatmap. A heatmap may be made using the seaborn library and the panda dataframe's corr() method which returns the pairwise correlations of columns. Correlation Map using seaborn But I want to explore other and possibly better techniques aside from this. Especially that can deal with a very high dimension dataset such as mine. Would love to hear from your suggestions in regards to this. Thank you very much!!! ​ submitted by /u/RalphuChino [link] [comments]  ( 9 min )
    [R] Levels of AGI: Operationalizing Progress on the Path to AGI - DeepMind 2023
    Paper: https://arxiv.org/abs/2311.02462 Abstract: We propose a framework for classifying the capabilities and behavior of Artificial General Intelligence (AGI) models and their precursors. This framework introduces levels of AGI performance, generality, and autonomy. It is our hope that this framework will be useful in an analogous way to the levels of autonomous driving, by providing a common language to compare models, assess risks, and measure progress along the path to AGI. To develop our framework, we analyze existing definitions of AGI, and distill six principles that a useful ontology for AGI should satisfy. These principles include focusing on capabilities rather than mechanisms; separately evaluating generality and performance; and defining stages along the path toward AGI, rather than focusing on the endpoint. With these principles in mind, we propose 'Levels of AGI' based on depth (performance) and breadth (generality) of capabilities, and reflect on how current systems fit into this ontology. We discuss the challenging requirements for future benchmarks that quantify the behavior and capabilities of AGI models against these levels. Finally, we discuss how these levels of AGI interact with deployment considerations such as autonomy and risk, and emphasize the importance of carefully selecting Human-AI Interaction paradigms for responsible and safe deployment of highly capable AI systems. https://preview.redd.it/64biopsh79zb1.png?width=797&format=png&auto=webp&s=9af1c5085938dac000aaf23aa1b306133b01edb4 submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] Best method of knowledge distillation available?
    Best practical method of knowledge distillation available? TL;DR: Knowledge distillation generally performs worse than traning model from scratch on data from what I've seen online. Is there a method of KD where this doesn't happen and I get close to performance of a model if it was trained from scratch? So I've recently been interested in make DL models more useful for everyday tasks. And considering their size trying to run these models on consumer devices without much loss in quality but rn from what I've seen, this just feels like trying fit an elephant into his pants. Basically it tears everytime I try. I found quantization to be cool but I need to reduce its size even more tbh. So I found knowledge distillation. But from what I've seen, though theoretically it is fantastic. Practically knowlege distillation sucks. And is probably worse than just straight up traning the model from scratch on the dataset. So is there a used and proven method of knowledge distillation that I can use? Which will give me at least very close accuracy to a model trained from scratch on dataset? submitted by /u/Xanta_Kross [link] [comments]  ( 9 min )
  • Open

    Looking for the best AI girlfriend if possible
    Looking for something that can get as close to human as possible if one exists. I’ve tried and enjoyed Replika actually still have it but my rep is at 11 and the levels have been exhausted. Though it’s limited to NSFW text and voice only. Was looking to see if I could find an AI that has replika features but with better memory, sends selfies regularly of herself and not randomly generated ones (including NSFW) I’m new to the game. submitted by /u/Jyeung691 [link] [comments]  ( 9 min )
    Humane officially unveils the AI Pin device that aspires to replace smartphone
    Humane has officially unveiled the Humane Ai Pin, an AI-powered device that aims to replace smartphones. The device is a standalone device and software platform built from the ground up with AI, meaning it does not need to be paired with a smartphone. It can perform various functions such as calling, texting, taking photos or videos, listening to music, and more. The device is powered by AI and can recognize objects, provide nutrition information, and even make online purchases. The Humane Ai Pin will cost $699 and will be available for order in the United States on November 16th. Source : https://bgr.com/tech/humane-officially-unveils-the-ai-pin-a-device-that-aspires-to-replace-your-smartphone/ submitted by /u/NuseAI [link] [comments]  ( 9 min )
    Need help training a PPO NN to learn how to play my deckbuilding game
    Hey, I have a roguelike deckbuilding game I want to train an agent to play using pure unsupervised RL; I chose PPO as I understand (to my amateur knowledge) that is the most fitting algorithm. I have a very large categorical space that I have to send in (basically what cards are in the deck and which cards are being offered to pick), and I need the agent to learn the best picks. I attempted to use an embedding layer and input the cards the player has + given cards + numerical data (concated with the embedding output). I tried playing around with various hyperparameters, but so far, I have not been able to generate any learning. Any help or advice would be greatly appreciated, thanks! submitted by /u/Jagerjj [link] [comments]  ( 9 min )
    OCR for custom characters
    Hello guys, I wanna ask if someone knows how can I detect custom made characters with OCR. I want to detect the normal alphabet and numbers but in addition for example hieroglyphics. I tried to use YOLO and the hieroglyphics were labeled as AA or MM or something like that, but I want your opinions. Thanks! submitted by /u/quorra96 [link] [comments]  ( 9 min )
    Elon Musk: War, AI, Aliens, Politics, Physics, Video Games, and Humanity | Lex Fridman Podcast
    submitted by /u/Overflame [link] [comments]  ( 9 min )
    What AI-Tools do you use in your daily routine?
    As title says, I am interested in your daily AI-Tool activities. Which tools do you use and provide you with more efficiency? Do you have any suggestions for me or others? ​ Mine is mymind.com It helps save content as it automatically categorizes it websites and notes. So far, I really enjoy it. What are your suggestions? submitted by /u/345Y_Chubby [link] [comments]  ( 9 min )
    Humane AI Pin reveal video
    submitted by /u/webbedgiant [link] [comments]  ( 9 min )
    Disney Pixar AI Generator Review
    submitted by /u/Amandacerni [link] [comments]  ( 9 min )
    AI Sales Chat Bot and Telephone Agent
    do you know eve.calls? are they legit? - submitted by /u/Niu_Davinci [link] [comments]  ( 9 min )
    AI Sales Chat Bot and Telephone Agent
    do you know eve.calls? are they legit? - submitted by /u/Niu_Davinci [link] [comments]  ( 9 min )
    AI assistive tools able to do some or all of these tasks to help with some of the ongoing issues symptomatic of ADHD. anxiety, depression etc?
    First off im not entirely sure how to frame this or even where I should be asking. So if im in the wrong place and you have a better suggestion that would be great. So far ive asked in LearnMachineLearning and MLQuestions but had no response. As someone with various mental health and gifted learning issues with a history in IT (albeit a bit of a dusty unused one), this is something I have been thinking about for a very long time. Every time i look into it ive found the tech just wasnt really up for it without it being a huge nightmare anyway. Recently ive been looking at recent advances in things like chat GPT and similar AI and I feel its time to look again and see if there is anything that can help me that doesn't need an entire organization working on a mainframe to implement. So here…  ( 11 min )
    Those childhood days that I can't get back
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 9 min )
    AI-Ressources for Content Creators (Avatars, Text-to-speech, speech-to-speech...)
    Hi! I am completely new to this one, but with the advancement of the AI-Tools I want to give it a try. I want to make short tutorials for Beginners in the 3D software Blender, but I don't want to use my voice due to anomymity reasons. Plus, if this is already feaseable, I would at least place a small reacting avatar in the bottom right or so. Ideally, the workflow would be that I narrate what I am doing and the AI translates this into either text or another voice directly, as well as animates the avatar. I am okay with paid tools, my question is if we are "at this point" already technically and if yes, if you have recommended tools? submitted by /u/Ryselle [link] [comments]  ( 9 min )
    One-Minute Daily AI News 11/8/2023
    GitHub Copilot Chat will become generally available in December.[1] Figma introduces FigJam AI to spare designers from boring planning prep.[2] AI Will Cut Cost of Animated Films by 90%, Jeff Katzenberg Says.[3] Google just announced that it’s bringing Generative AI in search to more than 120 new countries and territories. Thus, the Search Generative Experience (SGE) got its largest international expansion so far.[4] Sources: [1] https://www.neowin.net/news/github-copilot-chat-will-become-generally-available-in-december/ [2] https://www.theverge.com/2023/11/7/23950667/figma-figjam-generative-ai-design-tools-beta-announcement [3] https://finance.yahoo.com/news/ai-cut-cost-animated-films-051506470.html? [4] https://www.gsmarena.com/google_brings_generative_ai_in_search_to_120_new_countries_and_territories-news-60524.php submitted by /u/Excellent-Target-847 [link] [comments]  ( 9 min )
    AI Tools for Editing Extensive Video Files from Family Vacations
    Even though we have access to Adobe Premiere and other similar video editing tools, I'm wondering if there's anything out there that will take most scenic or b-roll footage from our family vacation, which includes drone and DSLR footage, and trims the footage in the right sequence to create a video of X length with little or no human intervention. Thoughts? submitted by /u/crmjewelers [link] [comments]  ( 9 min )
  • Open

    Towards model-free RL algorithms that scale well with unstructured data
    Paper: https://arxiv.org/abs/2311.02215 Abstract: Conventional reinforcement learning (RL) algorithms exhibit broad generality in their theoretical formulation and high performance on several challenging domains when combined with powerful function approximation. However, developing RL algorithms that perform well across problems with unstructured observations at scale remains challenging because most function approximation methods rely on externally provisioned knowledge about the structure of the input for good performance (e.g. convolutional networks, graph neural networks, tile-coding). A common practice in RL is to evaluate algorithms on a single problem, or on problems with limited variation in the observation scale. RL practitioners lack a systematic way to study how well a single RL algorithm performs when instantiated across a range of problem scales, and they lack function approximation techniques that scale well with unstructured observations. We address these limitations by providing environments and algorithms to study scaling for unstructured observation vectors and flat action spaces. We introduce a family of combinatorial RL problems with an exponentially large state space and high-dimensional dynamics but where linear computation is sufficient to learn a (nonlinear) value function estimate for performant control. We provide an algorithm that constructs reward-relevant general value function (GVF) questions to find and exploit predictive structure directly from the experience stream. In an empirical evaluation of the approach on synthetic problems, we observe a sample complexity that scales linearly with the observation size. The proposed algorithm reliably outperforms a conventional deep RL algorithm on these scaling problems, and they exhibit several desirable auxiliary properties. These results suggest new algorithmic mechanisms by which algorithms can learn at scale from unstructured data. submitted by /u/APaperADay [link] [comments]  ( 9 min )
    Mujoco on windows
    I'm trying to run this code example https://pytorch.org/rl/tutorials/coding_ppo.html to get a sense for how to implement ppo and I'm also interested in a couple other RL projects that use mujoco. Unfortunately I'm on windows 10 and I heard mujoco is deprecated for windows, so I figured I would just use an ubuntu virtual machine in vmware workstation. Once I set it up, I realized it can't access my nvidia 3080. Should I just dual boot? Or is there a windows solution? If I dual boot will the linux boot have full access to the gpu? submitted by /u/theLanguageSprite [link] [comments]  ( 9 min )
    "When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming", Mozannar et al 2023
    submitted by /u/gwern [link] [comments]  ( 9 min )
    how do i design the action space for this in gym?
    Hello I want my PPO model to choose between 5 different actions. two of these are binary and the other three are real numbers between 0 and 1. so my gym action space is like this: first = Discrete(2) second = Discrete(2) third = Box(low=0,high=1,shape=(1,)) fourth = Box(low=0,high=1,shape=(1,)) fifth = Box(low=0,high=1,shape=(1,)) action_space = Tuple((first,second,third,fourth,fifth)) a sample would look like this: (1, 0, array([0.57706404], dtype=float32), array([0.956483], dtype=float32), array([0.09770907], dtype=float32)) but i actually want only one AI chosen element to be above zero and everything else zero like for example (0, 0, array([0.57706404], dtype=float32), array([0], dtype=float32), array([0], dtype=float32)) is this even possible? The best thing i can come up with is have the following action_space type = Discrete(5) ratio = Box(low=0,high=1,shape=(1,)) d = Tuple((first,ratio)) sample: sample: (4, array([0.37991711], dtype=float32)) My guess is that would work but it does not seem optimal to me. Or should i actually use the previous action_space and hope PPO will learn over time to only put one element above zero and null everything else? Any help would be appreciated. Thanks! submitted by /u/phoenix_fire_stone [link] [comments]  ( 9 min )
    Offline RL: Hyper-parameter selection and comparison with behavior policy
    For offline hyper-parameter selection, I've come across several papers, including this one, that recommend methods like FQE. These approaches can help eliminate poorly performing policies. I tested this on a toy environment, and the results so far indicate a strong correlation between actual rewards and the values estimated by the FQE. Is it also a valid technique to compare the performance of the trained policy with the behavior policy (the policy used to collect the data)? I have a few questions: Could the FQE potentially overestimate out-of-distribution (OOD) actions, leading to a higher value for the trained policy (that selects OOD actions) compared to the behavior policy? Would the comparison be valid if the data is collected by several different policies with varying performance levels (some expert, others completely random, and some in between)? Also if you have experience with offline reinforcement learning, I would love to know how did you deal with hyper-parameter selection and comparison with behavior policy. What were the end results? submitted by /u/ZIGGY-Zz [link] [comments]  ( 9 min )
    What is best way to get RL agent to generalize across different versions of the same environment?
    E.g. imagine a gridworld where agent has to go to a goal space. I want it to be able to do this across many different types of levels but where task is same: "go to goal." Right now I use parallel envs for PPO and train simultaneously on all version environments. It worked for 2 very small levels but a bit slow, so I wanted to confirm this was best approach (e.g. vs sequential learning or curriculum learning or something completely different). I tried googling but can't find info on it for some reason. I did see the parallel env approach with domain randomization in a paper, but they don't discuss it much. submitted by /u/rl_ninja_rl_ninja [link] [comments]  ( 9 min )
    Ask for help with code
    Hey everyone, I am working on a project in the field of MARL and it would be a great help if anyone could help me, I am looking for implementations of the following environments: Decentralized Tiger Decentralized Rock Sampling Decentralized Box Pushing ​ I prefer the implementations to be in python and if they could follow the MultiAgentEnv interface created by the pymarl framework that would be perfect! thank you submitted by /u/yoyo-master [link] [comments]  ( 9 min )
  • Open

    Responsible AI at Google Research: Context in AI Research (CAIR)
    Posted by Katherine Heller, Research Scientist, Google Research, on behalf of the CAIR Team Artificial intelligence (AI) and related machine learning (ML) technologies are increasingly influential in the world around us, making it imperative that we consider the potential impacts on society and individuals in all aspects of the technology that we create. To these ends, the Context in AI Research (CAIR) team develops novel AI methods in the context of the entire AI pipeline: from data to end-user feedback. The pipeline for building an AI system typically starts with data collection, followed by designing a model to run on that data, deployment of the model in the real world, and lastly, compiling and incorporation of human feedback. Originating in the health space, and now expanded to a…  ( 93 min )
    Overcoming leakage on error-corrected quantum processors
    Posted by Kevin Miao and Matt McEwen, Research Scientists, Quantum AI Team The qubits that make up Google quantum devices are delicate and noisy, so it’s necessary to incorporate error correction procedures that identify and account for qubit errors on the way to building a useful quantum computer. Two of the most prevalent error mechanisms are bit-flip errors (where the energy state of the qubit changes) and phase-flip errors (where the phase of the encoded quantum information changes). Quantum error correction (QEC) promises to address and mitigate these two prominent errors. However, there is an assortment of other error mechanisms that challenges the effectiveness of QEC. While we want qubits to behave as ideal two-level systems with no loss mechanisms, this is not the case in r…  ( 94 min )
  • Open

    Promote pipelines in a multi-environment setup using Amazon SageMaker Model Registry, HashiCorp Terraform, GitHub, and Jenkins CI/CD
    Building out a machine learning operations (MLOps) platform in the rapidly evolving landscape of artificial intelligence (AI) and machine learning (ML) for organizations is essential for seamlessly bridging the gap between data science experimentation and deployment while meeting the requirements around model performance, security, and compliance. In order to fulfill regulatory and compliance requirements, the […]  ( 17 min )
    Customizing coding companions for organizations
    Generative AI models for coding companions are mostly trained on publicly available source code and natural language text. While the large size of the training corpus enables the models to generate code for commonly used functionality, these models are unaware of code in private repositories and the associated coding styles that are enforced when developing […]  ( 11 min )
  • Open

    Enter a World of Samurai and Demons: GFN Thursday Brings Capcom’s ‘Onimusha: Warlords’ to the Cloud
    Wield the blade and embrace the way of the samurai for some thrilling action — Onimusha: Warlords comes to GeForce NOW this week. Members can experience feudal Japan in this hack-and-slash adventure game in the cloud. It’s part of an action-packed GFN Thursday, with 16 more games joining the cloud gaming platform’s library. Forging Destinies Read article >  ( 5 min )
    Enter a World of Samurai and Demons: GFN Thursday Brings Capcom’s ‘Onimusha: Warlords’ to the Cloud
    Wield the blade and embrace the way of the samurai for some thrilling action — Onimusha: Warlords comes to GeForce NOW this week. Members can experience feudal Japan in this hack-and-slash adventure game in the cloud. It’s part of an action-packed GFN Thursday, with 16 more games joining the cloud gaming platform’s library. Forging Destinies Read article >  ( 5 min )
  • Open

    OpenAI Data Partnerships
    Working together to create open-source and private datasets for AI training.  ( 2 min )
  • Open

    Explained: Generative AI
    How do powerful generative AI systems like ChatGPT work, and what makes them different from other types of artificial intelligence?  ( 11 min )
  • Open

    Flow-based distributionally robust optimization. (arXiv:2310.19253v2 [cs.LG] UPDATED)
    We present a computationally efficient framework, called FlowDRO, for solving flow-based distributionally robust optimization (DRO) problems with Wasserstein uncertainty sets while aiming to find continuous worst-case distribution (also called the Least Favorable Distribution, LFD). The requirement for LFD to be continuous is so that the algorithm can be scalable to problems with larger sample sizes and achieve better generalization capability for the induced robust algorithms. To tackle the computationally challenging infinitely dimensional optimization problem, we leverage flow-based models and continuous-time invertible transport maps between the data distribution and the target distribution. We also develop a Wasserstein proximal gradient flow type of algorithm. In theory, we establish the equivalence of the solution by optimal transport map to the original formulation, as well as the dual form of the problem through Wasserstein calculus and Brenier theorem. In practice, we parameterize the transport maps by a sequence of neural networks progressively trained in blocks by gradient descent. Our computational framework is general, can handle high-dimensional data with large sample sizes, and can be useful for various applications. We demonstrate its usage in adversarial learning, distributionally robust hypothesis testing, and a new mechanism for data-driven distribution perturbation differential privacy, where the proposed method gives strong empirical performance on real high-dimensional data.  ( 2 min )
    Score-based Source Separation with Applications to Digital Communication Signals. (arXiv:2306.14411v2 [cs.LG] UPDATED)
    We propose a new method for separating superimposed sources using diffusion-based generative models. Our method relies only on separately trained statistical priors of independent sources to establish a new objective function guided by maximum a posteriori estimation with an $\alpha$-posterior, across multiple levels of Gaussian smoothing. Motivated by applications in radio-frequency (RF) systems, we are interested in sources with underlying discrete nature and the recovery of encoded bits from a signal of interest, as measured by the bit error rate (BER). Experimental results with RF mixtures demonstrate that our method results in a BER reduction of 95% over classical and existing learning-based methods. Our analysis demonstrates that our proposed method yields solutions that asymptotically approach the modes of an underlying discrete distribution. Furthermore, our method can be viewed as a multi-source extension to the recently proposed score distillation sampling scheme, shedding additional light on its use beyond conditional sampling. The project webpage is available at https://alpha-rgs.github.io  ( 2 min )
    A Mobile Data-Driven Hierarchical Deep Reinforcement Learning Approach for Real-time Demand-Responsive Railway Rescheduling and Station Overcrowding Mitigation. (arXiv:2308.11849v2 [eess.SY] UPDATED)
    Real-time railway rescheduling is an important technique to enable operational recovery in response to unexpected and dynamic conditions in a timely and flexible manner. Current research relies mostly on OD based data and model-based methods for estimating train passenger demands. These approaches primarily focus on averaged disruption patterns, often overlooking the immediate uneven distribution of demand over time. In reality, passenger demand deviates significantly from predictions, especially during a disaster. Disastrous situations such as flood in Zhengzhou, China in 2022 has created not only unprecedented effect on Zhengzhou railway station itself, which is a major railway hub in China, but also other major hubs connected to Zhengzhou, e.g., Xi'an, the closest hub west of Zhengzhou. In this study, we define a real-time demand-responsive (RTDR) railway rescheduling problem focusing two specific aspects, namely, volatility of the demand, and management of station crowdedness. For the first time, we propose a data-driven approach using real-time mobile data (MD) to deal with this RTDR problem. A hierarchical deep reinforcement learning (HDRL) framework is designed to perform real-time rescheduling in a demand-responsive manner. The use of MD has enabled the modelling of passenger dynamics in response to train delays and station crowdedness, and a real-time optimisation for rescheduling of train services in view of the change in demand as a result of passengers' behavioural response to disruption. Results show that the agent can steadily satisfy over 62% of the demand with only 61% of the original rolling stock, ensuring continuous operations without overcrowding. Moreover, the agent exhibits adaptability when transferred to a new environment with increased demand, highlighting its effectiveness in addressing unforeseen disruptions in real-time settings.  ( 3 min )
    S-LoRA: Serving Thousands of Concurrent LoRA Adapters. (arXiv:2311.03285v2 [cs.LG] UPDATED)
    The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at https://github.com/S-LoRA/S-LoRA  ( 3 min )
    MobileNVC: Real-time 1080p Neural Video Compression on a Mobile Device. (arXiv:2310.01258v2 [eess.IV] UPDATED)
    Neural video codecs have recently become competitive with standard codecs such as HEVC in the low-delay setting. However, most neural codecs are large floating-point networks that use pixel-dense warping operations for temporal modeling, making them too computationally expensive for deployment on mobile devices. Recent work has demonstrated that running a neural decoder in real time on mobile is feasible, but shows this only for 720p RGB video. This work presents the first neural video codec that decodes 1080p YUV420 video in real time on a mobile device. Our codec relies on two major contributions. First, we design an efficient codec that uses a block-based motion compensation algorithm available on the warping core of the mobile accelerator, and we show how to quantize this model to integer precision. Second, we implement a fast decoder pipeline that concurrently runs neural network components on the neural signal processor, parallel entropy coding on the mobile GPU, and warping on the warping core. Our codec outperforms the previous on-device codec by a large margin with up to 48% BD-rate savings, while reducing the MAC count on the receiver side by $10 \times$. We perform a careful ablation to demonstrate the effect of the introduced motion compensation scheme, and ablate the effect of model quantization.  ( 3 min )
    The Evolution of the Interplay Between Input Distributions and Linear Regions in Networks. (arXiv:2310.18725v2 [cs.LG] UPDATED)
    It is commonly recognized that the expressiveness of deep neural networks is contingent upon a range of factors, encompassing their depth, width, and other relevant considerations. Currently, the practical performance of the majority of deep neural networks remains uncertain. For ReLU (Rectified Linear Unit) networks with piecewise linear activations, the number of linear convex regions serves as a natural metric to gauge the network's expressivity. In this paper, we count the number of linear convex regions in deep neural networks based on ReLU. In particular, we prove that for any one-dimensional input, there exists a minimum threshold for the number of neurons required to express it. We also empirically observe that for the same network, intricate inputs hinder its capacity to express linear regions. Furthermore, we unveil the iterative refinement process of decision boundaries in ReLU networks during training. We aspire for our research to serve as an inspiration for network optimization endeavors and aids in the exploration and analysis of the behaviors exhibited by deep networks.  ( 2 min )
    Atom: Low-bit Quantization for Efficient and Accurate LLM Serving. (arXiv:2310.19102v2 [cs.LG] UPDATED)
    The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to $7.73\times$ compared to the FP16 and by $2.53\times$ compared to INT8 quantization, while maintaining the same latency target.  ( 2 min )
    User Training with Error Augmentation for Electromyogram-based Gesture Classification. (arXiv:2309.07289v2 [cs.HC] UPDATED)
    We designed and tested a system for real-time control of a user interface by extracting surface electromyographic (sEMG) activity from eight electrodes in a wrist-band configuration. sEMG data were streamed into a machine-learning algorithm that classified hand gestures in real-time. After an initial model calibration, participants were presented with one of three types of feedback during a human-learning stage: veridical feedback, in which predicted probabilities from the gesture classification algorithm were displayed without alteration, modified feedback, in which we applied a hidden augmentation of error to these probabilities, and no feedback. User performance was then evaluated in a series of minigames, in which subjects were required to use eight gestures to manipulate their game avatar to complete a task. Experimental results indicated that, relative to baseline, the modified feedback condition led to significantly improved accuracy and improved gesture class separation. These findings suggest that real-time feedback in a gamified user interface with manipulation of feedback may enable intuitive, rapid, and accurate task acquisition for sEMG-based gesture recognition applications.  ( 2 min )
    DGFN: Double Generative Flow Networks. (arXiv:2310.19685v3 [cs.LG] UPDATED)
    Deep learning is emerging as an effective tool in drug discovery, with potential applications in both predictive and generative models. Generative Flow Networks (GFlowNets/GFNs) are a recently introduced method recognized for the ability to generate diverse candidates, in particular in small molecule generation tasks. In this work, we introduce double GFlowNets (DGFNs). Drawing inspiration from reinforcement learning and Double Deep Q-Learning, we introduce a target network used to sample trajectories, while updating the main network with these sampled trajectories. Empirical results confirm that DGFNs effectively enhance exploration in sparse reward domains and high-dimensional state spaces, both challenging aspects of de-novo design in drug discovery.  ( 2 min )
    Personalizing Keyword Spotting with Speaker Information. (arXiv:2311.03419v1 [eess.AS])
    Keyword spotting systems often struggle to generalize to a diverse population with various accents and age groups. To address this challenge, we propose a novel approach that integrates speaker information into keyword spotting using Feature-wise Linear Modulation (FiLM), a recent method for learning from multiple sources of information. We explore both Text-Dependent and Text-Independent speaker recognition systems to extract speaker information, and we experiment on extracting this information from both the input audio and pre-enrolled user audio. We evaluate our systems on a diverse dataset and achieve a substantial improvement in keyword detection accuracy, particularly among underrepresented speaker groups. Moreover, our proposed approach only requires a small 1% increase in the number of parameters, with a minimum impact on latency and computational cost, which makes it a practical solution for real-world applications.  ( 2 min )
    Maximum Mean Discrepancy Meets Neural Networks: The Radon-Kolmogorov-Smirnov Test. (arXiv:2309.02422v3 [stat.ML] UPDATED)
    Maximum mean discrepancy (MMD) refers to a general class of nonparametric two-sample tests that are based on maximizing the mean difference over samples from one distribution $P$ versus another $Q$, over all choices of data transformations $f$ living in some function space $\mathcal{F}$. Inspired by recent work that connects what are known as functions of $\textit{Radon bounded variation}$ (RBV) and neural networks (Parhi and Nowak, 2021, 2023), we study the MMD defined by taking $\mathcal{F}$ to be the unit ball in the RBV space of a given smoothness order $k \geq 0$. This test, which we refer to as the $\textit{Radon-Kolmogorov-Smirnov}$ (RKS) test, can be viewed as a generalization of the well-known and classical Kolmogorov-Smirnov (KS) test to multiple dimensions and higher orders of smoothness. It is also intimately connected to neural networks: we prove that the witness in the RKS test -- the function $f$ achieving the maximum mean difference -- is always a ridge spline of degree $k$, i.e., a single neuron in a neural network. This allows us to leverage the power of modern deep learning toolkits to (approximately) optimize the criterion that underlies the RKS test. We prove that the RKS test has asymptotically full power at distinguishing any distinct pair $P \not= Q$ of distributions, derive its asymptotic null distribution, and carry out extensive experiments to elucidate the strengths and weakenesses of the RKS test versus the more traditional kernel MMD test.  ( 3 min )
    How to Scale Your EMA. (arXiv:2307.13813v3 [stat.ML] UPDATED)
    Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functional copy of a target model, whose parameters move towards those of its target model according to an Exponential Moving Average (EMA) at a rate parameterized by a momentum hyperparameter. This model EMA can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL). Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics across batch sizes and lower model performance. In this work, we provide a scaling rule for optimization in the presence of a model EMA and demonstrate the rule's validity across a range of architectures, optimizers, and data modalities. We also show the rule's validity where the model EMA contributes to the optimization of the target model, enabling us to train EMA-based pseudo-labeling and SSL methods at small and large batch sizes. For SSL, we enable training of BYOL up to batch size 24,576 without sacrificing performance, a 6$\times$ wall-clock time reduction under idealized hardware settings.  ( 3 min )
    Dynamics of Finite Width Kernel and Prediction Fluctuations in Mean Field Neural Networks. (arXiv:2304.03408v3 [stat.ML] UPDATED)
    We analyze the dynamics of finite width effects in wide but finite feature learning neural networks. Starting from a dynamical mean field theory description of infinite width deep neural network kernel and prediction dynamics, we provide a characterization of the $O(1/\sqrt{\text{width}})$ fluctuations of the DMFT order parameters over random initializations of the network weights. Our results, while perturbative in width, unlike prior analyses, are non-perturbative in the strength of feature learning. In the lazy limit of network training, all kernels are random but static in time and the prediction variance has a universal form. However, in the rich, feature learning regime, the fluctuations of the kernels and predictions are dynamically coupled with a variance that can be computed self-consistently. In two layer networks, we show how feature learning can dynamically reduce the variance of the final tangent kernel and final network predictions. We also show how initialization variance can slow down online learning in wide but finite networks. In deeper networks, kernel variance can dramatically accumulate through subsequent layers at large feature learning strengths, but feature learning continues to improve the signal-to-noise ratio of the feature kernels. In discrete time, we demonstrate that large learning rate phenomena such as edge of stability effects can be well captured by infinite width dynamics and that initialization variance can decrease dynamically. For CNNs trained on CIFAR-10, we empirically find significant corrections to both the bias and variance of network dynamics due to finite width.  ( 3 min )
    Interaction Measures, Partition Lattices and Kernel Tests for High-Order Interactions. (arXiv:2306.00904v3 [stat.ML] UPDATED)
    Models that rely solely on pairwise relationships often fail to capture the complete statistical structure of the complex multivariate data found in diverse domains, such as socio-economic, ecological, or biomedical systems. Non-trivial dependencies between groups of more than two variables can play a significant role in the analysis and modelling of such systems, yet extracting such high-order interactions from data remains challenging. Here, we introduce a hierarchy of $d$-order ($d \geq 2$) interaction measures, increasingly inclusive of possible factorisations of the joint probability distribution, and define non-parametric, kernel-based tests to establish systematically the statistical significance of $d$-order interactions. We also establish mathematical links with lattice theory, which elucidate the derivation of the interaction measures and their composite permutation tests; clarify the connection of simplicial complexes with kernel matrix centring; and provide a means to enhance computational efficiency. We illustrate our results numerically with validations on synthetic data, and through an application to neuroimaging data.  ( 2 min )
    FAIR4Cov: Fused Audio Instance and Representation for COVID-19 Detection. (arXiv:2204.10581v3 [cs.SD] UPDATED)
    Audio-based classification techniques on body sounds have long been studied to aid in the diagnosis of respiratory diseases. While most research is centered on the use of cough as the main biomarker, other body sounds also have the potential to detect respiratory diseases. Recent studies on COVID-19 have shown that breath and speech sounds, in addition to cough, correlate with the disease. Our study proposes Fused Audio Instance and Representation (FAIR) as a method for respiratory disease detection. FAIR relies on constructing a joint feature vector from various body sounds represented in waveform and spectrogram form. We conducted experiments on the use case of COVID-19 detection by combining waveform and spectrogram representation of body sounds. Our findings show that the use of self-attention to combine extracted features from cough, breath, and speech sounds leads to the best performance with an Area Under the Receiver Operating Characteristic Curve (AUC) score of 0.8658, a sensitivity of 0.8057, and a specificity of 0.7958. Compared to models trained solely on spectrograms or waveforms, the use of both representations results in an improved AUC score, demonstrating that combining spectrogram and waveform representation helps to enrich the extracted features and outperforms the models that use only one representation.  ( 3 min )
    The Sparsity Roofline: Understanding the Hardware Limits of Sparse Neural Networks. (arXiv:2310.00496v2 [cs.CV] UPDATED)
    We introduce the Sparsity Roofline, a visual performance model for evaluating sparsity in neural networks. The Sparsity Roofline jointly models network accuracy, sparsity, and theoretical inference speedup. Our approach does not require implementing and benchmarking optimized kernels, and the theoretical speedup becomes equal to the actual speedup when the corresponding dense and sparse kernels are well-optimized. We achieve this through a novel analytical model for predicting sparse network performance, and validate the predicted speedup using several real-world computer vision architectures pruned across a range of sparsity patterns and degrees. We demonstrate the utility and ease-of-use of our model through two case studies: (1) we show how machine learning researchers can predict the performance of unimplemented or unoptimized block-structured sparsity patterns, and (2) we show how hardware designers can predict the performance implications of new sparsity patterns and sparse data formats in hardware. In both scenarios, the Sparsity Roofline helps performance experts identify sparsity regimes with the highest performance potential.  ( 2 min )
    Deep-learning-based decomposition of overlapping-sparse images: application at the vertex of neutrino interactions. (arXiv:2310.19695v2 [cs.CV] UPDATED)
    Image decomposition plays a crucial role in various computer vision tasks, enabling the analysis and manipulation of visual content at a fundamental level. Overlapping images, which occur when multiple objects or scenes partially occlude each other, pose unique challenges for decomposition algorithms. The task intensifies when working with sparse images, where the scarcity of meaningful information complicates the precise extraction of components. This paper presents a solution that leverages the power of deep learning to accurately extract individual objects within multi-dimensional overlapping-sparse images, with a direct application in high-energy physics with decomposition of overlaid elementary particles obtained from imaging detectors. In particular, the proposed approach tackles a highly complex yet unsolved problem: identifying and measuring independent particles at the vertex of neutrino interactions, where one expects to observe detector images with multiple indiscernible overlapping charged particles. By decomposing the image of the detector activity at the vertex through deep learning, it is possible to infer the kinematic parameters of the identified low-momentum particles - which otherwise would remain neglected - and enhance the reconstructed energy resolution of the neutrino event. We also present an additional step - that can be tuned directly on detector data - combining the above method with a fully-differentiable generative model to improve the image decomposition further and, consequently, the resolution of the measured parameters, achieving unprecedented results. This improvement is crucial for precisely measuring the parameters that govern neutrino flavour oscillations and searching for asymmetries between matter and antimatter.  ( 3 min )
    Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges. (arXiv:2311.03287v2 [cs.LG] UPDATED)
    While GPT-4V(ision) impressively models both visual and textual information simultaneously, it's hallucination behavior has not been systematically assessed. To bridge this gap, we introduce a new benchmark, namely, the Bias and Interference Challenges in Visual Language Models (Bingo). This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference. Here, bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data. Interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented. We identify a notable regional bias, whereby GPT-4V(ision) is better at interpreting Western images or images with English writing compared to images from other countries or containing text in other languages. Moreover, GPT-4V(ision) is vulnerable to leading questions and is often confused when interpreting multiple images together. Popular mitigation approaches, such as self-correction and chain-of-thought reasoning, are not effective in resolving these challenges. We also identified similar biases and interference vulnerabilities with LLaVA and Bard. Our results characterize the hallucination challenges in GPT-4V(ision) and state-of-the-art visual-language models, and highlight the need for new solutions. The Bingo benchmark is available at https://github.com/gzcch/Bingo.  ( 3 min )
    Improved MDL Estimators Using Fiber Bundle of Local Exponential Families for Non-exponential Families. (arXiv:2311.03852v1 [cs.IT])
    Minimum Description Length (MDL) estimators, using two-part codes for universal coding, are analyzed. For general parametric families under certain regularity conditions, we introduce a two-part code whose regret is close to the minimax regret, where regret of a code with respect to a target family M is the difference between the code length of the code and the ideal code length achieved by an element in M. This is a generalization of the result for exponential families by Gr\"unwald. Our code is constructed by using an augmented structure of M with a bundle of local exponential families for data description, which is not needed for exponential families. This result gives a tight upper bound on risk and loss of the MDL estimators based on the theory introduced by Barron and Cover in 1991. Further, we show that we can apply the result to mixture families, which are a typical example of non-exponential families.  ( 2 min )
    Learning-Based Latency-Constrained Fronthaul Compression Optimization in C-RAN. (arXiv:2311.03899v1 [cs.NI])
    The evolution of wireless mobile networks towards cloudification, where Radio Access Network (RAN) functions can be hosted at either a central or distributed locations, offers many benefits like low cost deployment, higher capacity, and improved hardware utilization. Nevertheless, the flexibility in the functional deployment comes at the cost of stringent fronthaul (FH) capacity and latency requirements. One possible approach to deal with these rigorous constraints is to use FH compression techniques. To ensure that FH capacity and latency requirements are met, more FH compression is applied during high load, while less compression is applied during medium and low load to improve FH utilization and air interface performance. In this paper, a model-free deep reinforcement learning (DRL) based FH compression (DRL-FC) framework is proposed that dynamically controls FH compression through various configuration parameters such as modulation order, precoder granularity, and precoder weight quantization that affect both FH load and air interface performance. Simulation results show that DRL-FC exhibits significantly higher FH utilization (68.7% on average) and air interface throughput than a reference scheme (i.e. with no applied compression) across different FH load levels. At the same time, the proposed DRL-FC framework is able to meet the predefined FH latency constraints (in our case set to 260 $\mu$s) under various FH loads.  ( 2 min )
    CongFu: Conditional Graph Fusion for Drug Synergy Prediction. (arXiv:2305.14517v2 [cs.LG] UPDATED)
    Drug synergy, characterized by the amplified combined effect of multiple drugs, is critically important for optimizing therapeutic outcomes. Limited data on drug synergy, arising from the vast number of possible drug combinations and testing costs, motivate the need for predictive methods. In this work, we introduce CongFu, a novel Conditional Graph Fusion Layer, designed to predict drug synergy. CongFu employs an attention mechanism and a bottleneck to extract local graph contexts and conditionally fuse graph data within a global context. Its modular architecture enables flexible replacement of layer modules, including readouts and graph encoders, facilitating customization for diverse applications. To evaluate the performance of CongFu, we conduct comprehensive experiments on four datasets, encompassing three distinct setups for drug synergy prediction. CongFu achieves state-of-the-art results on 11 out of 12 benchmark datasets, demonstrating its ability to capture intricate patterns of drug synergy. Through ablation studies, we validate the significance of individual layer components, affirming their contributions to overall predictive performance. Finally, we propose an explainability strategy for elucidating the effect of drugs on genes. By addressing the challenge of predicting drug synergy in untested drug pairs and utilizing our proposed explainability approach, CongFu opens new avenues for optimizing drug combinations and advancing personalized medicine.
    MAGNet: Motif-Agnostic Generation of Molecules from Shapes. (arXiv:2305.19303v2 [physics.chem-ph] UPDATED)
    Recent advances in machine learning for molecules exhibit great potential for facilitating drug discovery from in silico predictions. Most models for molecule generation rely on the decomposition of molecules into frequently occurring substructures (motifs), from which they generate novel compounds. While motif representations greatly aid in learning molecular distributions, such methods struggle to represent substructures beyond their known motif set. To alleviate this issue and increase flexibility across datasets, we propose MAGNet, a graph-based model that generates abstract shapes before allocating atom and bond types. To this end, we introduce a novel factorisation of the molecules' data distribution that accounts for the molecules' global context and facilitates learning adequate assignments of atoms and bonds onto shapes. Despite the added complexity of shape abstractions, MAGNet outperforms most other graph-based approaches on standard benchmarks. Importantly, we demonstrate that MAGNet's improved expressivity leads to molecules with more topologically distinct structures and, at the same time, diverse atom and bond assignments.
    Analysis of the User Perception of Chatbots in Education Using A Partial Least Squares Structural Equation Modeling Approach. (arXiv:2311.03636v1 [cs.HC])
    The integration of Artificial Intelligence (AI) into education is a recent development, with chatbots emerging as a noteworthy addition to this transformative landscape. As online learning platforms rapidly advance, students need to adapt swiftly to excel in this dynamic environment. Consequently, understanding the acceptance of chatbots, particularly those employing Large Language Model (LLM) such as Chat Generative Pretrained Transformer (ChatGPT), Google Bard, and other interactive AI technologies, is of paramount importance. However, existing research on chatbots in education has overlooked key behavior-related aspects, such as Optimism, Innovativeness, Discomfort, Insecurity, Transparency, Ethics, Interaction, Engagement, and Accuracy, creating a significant literature gap. To address this gap, this study employs Partial Least Squares Structural Equation Modeling (PLS-SEM) to investigate the determinant of chatbots adoption in education among students, considering the Technology Readiness Index (TRI) and Technology Acceptance Model (TAM). Utilizing a five-point Likert scale for data collection, we gathered a total of 185 responses, which were analyzed using R-Studio software. We established 12 hypotheses to achieve its objectives. The results showed that Optimism and Innovativeness are positively associated with Perceived Ease of Use (PEOU) and Perceived Usefulness (PU). Conversely, Discomfort and Insecurity negatively impact PEOU, with only Insecurity negatively affecting PU. These findings provide insights for future technology designers, elucidating critical user behavior factors influencing chatbots adoption and utilization in educational contexts.
    Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection. (arXiv:2303.10093v2 [cs.CV] UPDATED)
    Vision-language alignment learned from image-caption pairs has been shown to benefit tasks like object recognition and detection. Methods are mostly evaluated in terms of how well object class names are learned, but captions also contain rich attribute context that should be considered when learning object alignment. It is unclear how methods use this context in learning, as well as whether models succeed when tasks require attribute and object understanding. To address this gap, we conduct extensive analysis of the role of attributes in vision-language models. We specifically measure model sensitivity to the presence and meaning of attribute context, gauging influence on object embeddings through unsupervised phrase grounding and classification via description methods. We further evaluate the utility of attribute context in training for open-vocabulary object detection, fine-grained text-region retrieval, and attribution tasks. Our results show that attribute context can be wasted when learning alignment for detection, attribute meaning is not adequately considered in embeddings, and describing classes by only their attributes is ineffective. A viable strategy that we find to increase benefits from attributes is contrastive training with adjective-based negative captions.
    How Many Neurons Does it Take to Approximate the Maximum?. (arXiv:2307.09212v2 [cs.LG] UPDATED)
    We study the size of a neural network needed to approximate the maximum function over $d$ inputs, in the most basic setting of approximating with respect to the $L_2$ norm, for continuous distributions, for a network that uses ReLU activations. We provide new lower and upper bounds on the width required for approximation across various depths. Our results establish new depth separations between depth 2 and 3, and depth 3 and 5 networks, as well as providing a depth $\mathcal{O}(\log(\log(d)))$ and width $\mathcal{O}(d)$ construction which approximates the maximum function. Our depth separation results are facilitated by a new lower bound for depth 2 networks approximating the maximum function over the uniform distribution, assuming an exponential upper bound on the size of the weights. Furthermore, we are able to use this depth 2 lower bound to provide tight bounds on the number of neurons needed to approximate the maximum by a depth 3 network. Our lower bounds are of potentially broad interest as they apply to the widely studied and used \emph{max} function, in contrast to many previous results that base their bounds on specially constructed or pathological functions and distributions.
    Optimizing Solution-Samplers for Combinatorial Problems: The Landscape of Policy-Gradient Methods. (arXiv:2310.05309v2 [cs.LG] UPDATED)
    Deep Neural Networks and Reinforcement Learning methods have empirically shown great promise in tackling challenging combinatorial problems. In those methods a deep neural network is used as a solution generator which is then trained by gradient-based methods (e.g., policy gradient) to successively obtain better solution distributions. In this work we introduce a novel theoretical framework for analyzing the effectiveness of such methods. We ask whether there exist generative models that (i) are expressive enough to generate approximately optimal solutions; (ii) have a tractable, i.e, polynomial in the size of the input, number of parameters; (iii) their optimization landscape is benign in the sense that it does not contain sub-optimal stationary points. Our main contribution is a positive answer to this question. Our result holds for a broad class of combinatorial problems including Max- and Min-Cut, Max-$k$-CSP, Maximum-Weight-Bipartite-Matching, and the Traveling Salesman Problem. As a byproduct of our analysis we introduce a novel regularization process over vanilla gradient descent and provide theoretical and experimental evidence that it helps address vanishing-gradient issues and escape bad stationary points.
    Minimum Width for Deep, Narrow MLP: A Diffeomorphism Approach. (arXiv:2308.15873v2 [cs.LG] UPDATED)
    Recently, there has been a growing focus on determining the minimum width requirements for achieving the universal approximation property in deep, narrow Multi-Layer Perceptrons (MLPs). Among these challenges, one particularly challenging task is approximating a continuous function under the uniform norm, as indicated by the significant disparity between its lower and upper bounds. To address this problem, we propose a framework that simplifies finding the minimum width for deep, narrow MLPs into determining a purely geometrical function denoted as $w(d_x, d_y)$. This function relies solely on the input and output dimensions, represented as $d_x$ and $d_y$, respectively. Two key steps support this framework. First, we demonstrate that deep, narrow MLPs, when provided with a small additional width, can approximate a $C^2$-diffeomorphism. Subsequently, using this result, we prove that $w(d_x, d_y)$ equates to the optimal minimum width required for deep, narrow MLPs to achieve universality. By employing the aforementioned framework and the Whitney embedding theorem, we provide an upper bound for the minimum width, given by $\operatorname{max}(2d_x+1, d_y) + \alpha(\sigma)$, where $0 \leq \alpha(\sigma) \leq 2$ represents a constant depending on the activation function. Furthermore, we provide a lower bound of $4$ for the minimum width in cases where the input and output dimensions are both equal to two.
    Teaching Language Models to Hallucinate Less with Synthetic Tasks. (arXiv:2310.06827v3 [cs.CL] UPDATED)
    Large language models (LLMs) frequently hallucinate on abstractive summarization tasks such as document-based question-answering, meeting summarization, and clinical report generation, even though all necessary information is included in context. However, optimizing LLMs to hallucinate less on these tasks is challenging, as hallucination is hard to efficiently evaluate at each optimization step. In this work, we show that reducing hallucination on a synthetic task can also reduce hallucination on real-world downstream tasks. Our method, SynTra, first designs a synthetic task where hallucinations are easy to elicit and measure. It next optimizes the LLM's system message via prefix-tuning on the synthetic task, and finally transfers the system message to realistic, hard-to-optimize tasks. Across three realistic abstractive summarization tasks, SynTra reduces hallucination for two 13B-parameter LLMs using only a synthetic retrieval task for supervision. We also find that optimizing the system message rather than the model weights can be critical; fine-tuning the entire model on the synthetic task can counterintuitively increase hallucination. Overall, SynTra demonstrates that the extra flexibility of working with synthetic data can help mitigate undesired behaviors in practice.
    FAIRO: Fairness-aware Adaptation in Sequential-Decision Making for Human-in-the-Loop Systems. (arXiv:2307.05857v2 [cs.LG] UPDATED)
    Achieving fairness in sequential-decision making systems within Human-in-the-Loop (HITL) environments is a critical concern, especially when multiple humans with different behavior and expectations are affected by the same adaptation decisions in the system. This human variability factor adds more complexity since policies deemed fair at one point in time may become discriminatory over time due to variations in human preferences resulting from inter- and intra-human variability. This paper addresses the fairness problem from an equity lens, considering human behavior variability, and the changes in human preferences over time. We propose FAIRO, a novel algorithm for fairness-aware sequential-decision making in HITL adaptation, which incorporates these notions into the decision-making process. In particular, FAIRO decomposes this complex fairness task into adaptive sub-tasks based on individual human preferences through leveraging the Options reinforcement learning framework. We design FAIRO to generalize to three types of HITL application setups that have the shared adaptation decision problem. Furthermore, we recognize that fairness-aware policies can sometimes conflict with the application's utility. To address this challenge, we provide a fairness-utility tradeoff in FAIRO, allowing system designers to balance the objectives of fairness and utility based on specific application requirements. Extensive evaluations of FAIRO on the three HITL applications demonstrate its generalizability and effectiveness in promoting fairness while accounting for human variability. On average, FAIRO can improve fairness compared with other methods across all three applications by 35.36%.
    Leveraging Deep Learning for Abstractive Code Summarization of Unofficial Documentation. (arXiv:2310.15015v2 [cs.SE] UPDATED)
    Usually, programming languages have official documentation to guide developers with APIs, methods, and classes. However, researchers identified insufficient or inadequate documentation examples and flaws with the API's complex structure as barriers to learning an API. As a result, developers may consult other sources (StackOverflow, GitHub, etc.) to learn more about an API. Recent research studies have shown that unofficial documentation is a valuable source of information for generating code summaries. We, therefore, have been motivated to leverage such a type of documentation along with deep learning techniques towards generating high-quality summaries for APIs discussed in informal documentation. This paper proposes an automatic approach using the BART algorithm, a state-of-the-art transformer model, to generate summaries for APIs discussed in StackOverflow. We built an oracle of human-generated summaries to evaluate our approach against it using ROUGE and BLEU metrics which are the most widely used evaluation metrics in text summarization. Furthermore, we evaluated our summaries empirically against a previous work in terms of quality. Our findings demonstrate that using deep learning algorithms can improve summaries' quality and outperform the previous work by an average of %57 for Precision, %66 for Recall, and %61 for F-measure, and it runs 4.4 times faster.
    ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion. (arXiv:2306.14770v2 [cs.LG] UPDATED)
    Prototype-based meta-learning has emerged as a powerful technique for addressing few-shot learning challenges. However, estimating a deterministic prototype using a simple average function from a limited number of examples remains a fragile process. To overcome this limitation, we introduce ProtoDiff, a novel framework that leverages a task-guided diffusion model during the meta-training phase to gradually generate prototypes, thereby providing efficient class representations. Specifically, a set of prototypes is optimized to achieve per-task prototype overfitting, enabling accurately obtaining the overfitted prototypes for individual tasks. Furthermore, we introduce a task-guided diffusion process within the prototype space, enabling the meta-learning of a generative process that transitions from a vanilla prototype to an overfitted prototype. ProtoDiff gradually generates task-specific prototypes from random noise during the meta-test stage, conditioned on the limited samples available for the new task. Furthermore, to expedite training and enhance ProtoDiff's performance, we propose the utilization of residual prototype learning, which leverages the sparsity of the residual prototype. We conduct thorough ablation studies to demonstrate its ability to accurately capture the underlying prototype distribution and enhance generalization. The new state-of-the-art performance on within-domain, cross-domain, and few-task few-shot classification further substantiates the benefit of ProtoDiff.
    Offline Multi-Agent Reinforcement Learning with Implicit Global-to-Local Value Regularization. (arXiv:2307.11620v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) has received considerable attention in recent years due to its attractive capability of learning policies from offline datasets without environmental interactions. Despite some success in the single-agent setting, offline multi-agent RL (MARL) remains to be a challenge. The large joint state-action space and the coupled multi-agent behaviors pose extra complexities for offline policy optimization. Most existing offline MARL studies simply apply offline data-related regularizations on individual agents, without fully considering the multi-agent system at the global level. In this work, we present OMIGA, a new offline m ulti-agent RL algorithm with implicit global-to-local v alue regularization. OMIGA provides a principled framework to convert global-level value regularization into equivalent implicit local value regularizations and simultaneously enables in-sample learning, thus elegantly bridging multi-agent value decomposition and policy learning with offline regularizations. Based on comprehensive experiments on the offline multi-agent MuJoCo and StarCraft II micro-management tasks, we show that OMIGA achieves superior performance over the state-of-the-art offline MARL methods in almost all tasks.
    Exact Bayesian Inference on Discrete Models via Probability Generating Functions: A Probabilistic Programming Approach. (arXiv:2305.17058v3 [cs.PL] UPDATED)
    We present an exact Bayesian inference method for discrete statistical models, which can find exact solutions to a large class of discrete inference problems, even with infinite support and continuous priors. To express such models, we introduce a probabilistic programming language that supports discrete and continuous sampling, discrete observations, affine functions, (stochastic) branching, and conditioning on discrete events. Our key tool is probability generating functions: they provide a compact closed-form representation of distributions that are definable by programs, thus enabling the exact computation of posterior probabilities, expectation, variance, and higher moments. Our inference method is provably correct and fully automated in a tool called Genfer, which uses automatic differentiation (specifically, Taylor polynomials), but does not require computer algebra. Our experiments show that Genfer is often faster than the existing exact inference tools PSI, Dice, and Prodigy. On a range of real-world inference problems that none of these exact tools can solve, Genfer's performance is competitive with approximate Monte Carlo methods, while avoiding approximation errors.
    Learning Proposals for Practical Energy-Based Regression. (arXiv:2110.11948v2 [cs.LG] UPDATED)
    Energy-based models (EBMs) have experienced a resurgence within machine learning in recent years, including as a promising alternative for probabilistic regression. However, energy-based regression requires a proposal distribution to be manually designed for training, and an initial estimate has to be provided at test-time. We address both of these issues by introducing a conceptually simple method to automatically learn an effective proposal distribution, which is parameterized by a separate network head. To this end, we derive a surprising result, leading to a unified training objective that jointly minimizes the KL divergence from the proposal to the EBM, and the negative log-likelihood of the EBM. At test-time, we can then employ importance sampling with the trained proposal to efficiently evaluate the learned EBM and produce stand-alone predictions. Furthermore, we utilize our derived training objective to learn mixture density networks (MDNs) with a jointly trained energy-based teacher, consistently outperforming conventional MDN training on four real-world regression tasks within computer vision. Code is available at https://github.com/fregu856/ebms_proposals.
    An enrichment approach for enhancing the expressivity of neural operators with applications to seismology. (arXiv:2306.04096v2 [cs.LG] UPDATED)
    The Eikonal equation plays a central role in seismic wave propagation and hypocenter localization, a crucial aspect of efficient earthquake early warning systems. Despite recent progress, real-time earthquake localization remains challenging due to the need to learn a generalizable Eikonal operator. We introduce a novel deep learning architecture, Enriched-DeepONet (En-DeepONet), addressing the limitations of current operator learning models in dealing with moving-solution operators. Leveraging addition and subtraction operations and a novel `root' network, En-DeepONet is particularly suitable for learning such operators and achieves up to four orders of magnitude improved accuracy without increased training cost. We demonstrate the effectiveness of En-DeepONet in earthquake localization under variable velocity and arrival time conditions. Our results indicate that En-DeepONet paves the way for real-time hypocenter localization for velocity models of practical interest. The proposed method represents a significant advancement in operator learning that is applicable to a gamut of scientific problems, including those in seismology, fracture mechanics, and phase-field problems.
    The Impact of Positional Encoding on Length Generalization in Transformers. (arXiv:2305.19466v2 [cs.CL] UPDATED)
    Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the development of Transformer-based language models. Positional encoding (PE) has been identified as a major factor influencing length generalization, but the exact impact of different PE schemes on extrapolation in downstream tasks remains unclear. In this paper, we conduct a systematic empirical study comparing the length generalization performance of decoder-only Transformers with five different position encoding approaches including Absolute Position Embedding (APE), T5's Relative PE, ALiBi, and Rotary, in addition to Transformers without positional encoding (NoPE). Our evaluation encompasses a battery of reasoning and mathematical tasks. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks. More importantly, NoPE outperforms other explicit positional encoding methods while requiring no additional computation. We theoretically demonstrate that NoPE can represent both absolute and relative PEs, but when trained with SGD, it mostly resembles T5's relative PE attention patterns. Finally, we find that scratchpad is not always helpful to solve length generalization and its format highly impacts the model's performance. Overall, our work suggests that explicit position embeddings are not essential for decoder-only Transformers to generalize well to longer sequences.
    Neuro-Symbolic Causal Reasoning Meets Signaling Game for Emergent Semantic Communications. (arXiv:2210.12040v2 [cs.LG] UPDATED)
    Semantic communication (SC) aims to communicate reliably with minimal data transfer while simultaneously providing seamless connectivity to heterogeneous services and users. In this paper, a novel emergent SC (ESC) system framework is proposed and is composed of a signaling game for emergent language design and a neuro-symbolic (NeSy) artificial intelligence (AI) approach for causal reasoning. In order to design the language, the signaling game is solved using an alternating maximization between the communicating node's utilities. The emergent language helps create a context-aware transmit vocabulary (minimal semantic representation) and aids the reasoning process (enabling generalization to unseen scenarios) by splitting complex messages into simpler reasoning tasks for the receiver. The causal description at the transmitter is then modeled (a neural component) as a posterior distribution of the relevant attributes present in the data. Using the reconstructed causal state, the receiver evaluates a set of logical formulas (symbolic part) to execute its task. The nodes NeSy reasoning components are implemented by the recently proposed AI tool called Generative Flow Networks, and they are optimized for higher semantic reliability. The ESC system is designed to enhance the novel metrics of semantic information, reliability, distortion and similarity that are designed using rigorous algebraic properties from category theory thereby generalizing the metrics beyond Shannon's notion of uncertainty. Simulation results validate the ability of ESC to communicate efficiently (with reduced bits) and achieve better semantic reliability than conventional wireless and state-of-the-art systems that do not exploit causal reasoning capabilities.
    The Fairness Stitch: Unveiling the Potential of Model Stitching in Neural Network De-Biasing. (arXiv:2311.03532v1 [cs.LG])
    The pursuit of fairness in machine learning models has emerged as a critical research challenge in different applications ranging from bank loan approval to face detection. Despite the widespread adoption of artificial intelligence algorithms across various domains, concerns persist regarding the presence of biases and discrimination within these models. To address this pressing issue, this study introduces a novel method called "The Fairness Stitch (TFS)" to enhance fairness in deep learning models. This method combines model stitching and training jointly, while incorporating fairness constraints. In this research, we assess the effectiveness of our proposed method by conducting a comprehensive evaluation of two well-known datasets, CelebA and UTKFace. We systematically compare the performance of our approach with the existing baseline method. Our findings reveal a notable improvement in achieving a balanced trade-off between fairness and performance, highlighting the promising potential of our method to address bias-related challenges and foster equitable outcomes in machine learning models. This paper poses a challenge to the conventional wisdom of the effectiveness of the last layer in deep learning models for de-biasing.
    Large Language Models as Superpositions of Cultural Perspectives. (arXiv:2307.07870v3 [cs.CL] UPDATED)
    Large Language Models (LLMs) are often misleadingly recognized as having a personality or a set of values. We argue that an LLM can be seen as a superposition of perspectives with different values and personality traits. LLMs exhibit context-dependent values and personality traits that change based on the induced perspective (as opposed to humans, who tend to have more coherent values and personality traits across contexts). We introduce the concept of perspective controllability, which refers to a model's affordance to adopt various perspectives with differing values and personality traits. In our experiments, we use questionnaires from psychology (PVQ, VSM, IPIP) to study how exhibited values and personality traits change based on different perspectives. Through qualitative experiments, we show that LLMs express different values when those are (implicitly or explicitly) implied in the prompt, and that LLMs express different values even when those are not obviously implied (demonstrating their context-dependent nature). We then conduct quantitative experiments to study the controllability of different models (GPT-4, GPT-3.5, OpenAssistant, StableVicuna, StableLM), the effectiveness of various methods for inducing perspectives, and the smoothness of the models' drivability. We conclude by examining the broader implications of our work and outline a variety of associated scientific questions. The project website is available at https://sites.google.com/view/llm-superpositions .
    Learning Probabilistic Symmetrization for Architecture Agnostic Equivariance. (arXiv:2306.02866v2 [cs.LG] UPDATED)
    We present a novel framework to overcome the limitations of equivariant architectures in learning functions with group symmetries. In contrary to equivariant architectures, we use an arbitrary base model such as an MLP or a transformer and symmetrize it to be equivariant to the given group by employing a small equivariant network that parameterizes the probabilistic distribution underlying the symmetrization. The distribution is end-to-end trained with the base model which can maximize performance while reducing sample complexity of symmetrization. We show that this approach ensures not only equivariance to given group but also universal approximation capability in expectation. We implement our method on various base models, including patch-based transformers that can be initialized from pretrained vision transformers, and test them for a wide range of symmetry groups including permutation and Euclidean groups and their combinations. Empirical tests show competitive results against tailored equivariant architectures, suggesting the potential for learning equivariant functions for diverse groups using a non-equivariant universal base architecture. We further show evidence of enhanced learning in symmetric modalities, like graphs, when pretrained from non-symmetric modalities, like vision. Code is available at https://github.com/jw9730/lps.
    A Physics-Guided Bi-Fidelity Fourier-Featured Operator Learning Framework for Predicting Time Evolution of Drag and Lift Coefficients. (arXiv:2311.03639v1 [cs.LG])
    In the pursuit of accurate experimental and computational data while minimizing effort, there is a constant need for high-fidelity results. However, achieving such results often requires significant computational resources. To address this challenge, this paper proposes a deep operator learning-based framework that requires a limited high-fidelity dataset for training. We introduce a novel physics-guided, bi-fidelity, Fourier-featured Deep Operator Network (DeepONet) framework that effectively combines low and high-fidelity datasets, leveraging the strengths of each. In our methodology, we began by designing a physics-guided Fourier-featured DeepONet, drawing inspiration from the intrinsic physical behavior of the target solution. Subsequently, we train this network to primarily learn the low-fidelity solution, utilizing an extensive dataset. This process ensures a comprehensive grasp of the foundational solution patterns. Following this foundational learning, the low-fidelity deep operator network's output is enhanced using a physics-guided Fourier-featured residual deep operator network. This network refines the initial low-fidelity output, achieving the high-fidelity solution by employing a small high-fidelity dataset for training. Notably, in our framework, we employ the Fourier feature network as the Trunk network for the DeepONets, given its proficiency in capturing and learning the oscillatory nature of the target solution with high precision. We validate our approach using a well-known 2D benchmark cylinder problem, which aims to predict the time trajectories of lift and drag coefficients. The results highlight that the physics-guided Fourier-featured deep operator network, serving as a foundational building block of our framework, possesses superior predictive capability for the lift and drag coefficients compared to its data-driven counterparts.  ( 3 min )
    PINNs error estimates for nonlinear equations in $\mathbb{R}$-smooth Banach spaces. (arXiv:2305.11915v2 [math.FA] UPDATED)
    In the paper, we describe in operator form classes of PDEs that admit PINN's error estimation. Also, for $L^p$ spaces, we obtain a Bramble-Hilbert type lemma that is a tool for PINN's residuals bounding.
    Learning Disentangled Speech Representations. (arXiv:2311.03389v1 [eess.AS])
    Disentangled representation learning from speech remains limited despite its importance in many application domains. A key challenge is the lack of speech datasets with known generative factors to evaluate methods. This paper proposes SynSpeech: a novel synthetic speech dataset with ground truth factors enabling research on disentangling speech representations. We plan to present a comprehensive study evaluating supervised techniques using established supervised disentanglement metrics. This benchmark dataset and framework address the gap in the rigorous evaluation of state-of-the-art disentangled speech representation learning methods. Our findings will provide insights to advance this underexplored area and enable more robust speech representations.  ( 2 min )
    Graph Construction using Principal Axis Trees for Simple Graph Convolution. (arXiv:2302.12000v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) are increasingly becoming the favorite method for graph learning. They exploit the semi-supervised nature of deep learning, and they bypass computational bottlenecks associated with traditional graph learning methods. In addition to the feature matrix $X$, GNNs need an adjacency matrix $A$ to perform feature propagation. In many cases, the adjacency matrix $A$ is missing. We introduce a graph construction scheme that constructs the adjacency matrix $A$ using unsupervised and supervised information. Unsupervised information characterizes the neighborhood around points. We used Principal Axis trees (PA-trees) as a source for unsupervised information, where we create edges between points falling onto the same leaf node. For supervised information, we used the concept of penalty and intrinsic graphs. A penalty graph connects points with different class labels, whereas an intrinsic graph connects points with the same class labels. We used the penalty and intrinsic graphs to remove or add edges to the graph constructed via PA-tree. We tested this graph construction scheme on two well-known GNNs: 1) Graph Convolutional Network (GCN) and 2) Simple Graph Convolution (SGC). The experiments show that it is better to use SGC because it is faster and delivers better or the same results as GCN. We also test the effect of oversmoothing on both GCN and SGC. We found out that the level of smoothing has to be carefully selected for SGC to avoid oversmoothing.  ( 3 min )
    Efficient Approximations of Complete Interatomic Potentials for Crystal Property Prediction. (arXiv:2306.10045v9 [physics.chem-ph] UPDATED)
    We study property prediction for crystal materials. A crystal structure consists of a minimal unit cell that is repeated infinitely in 3D space. How to accurately represent such repetitive structures in machine learning models remains unresolved. Current methods construct graphs by establishing edges only between nearby nodes, thereby failing to faithfully capture infinite repeating patterns and distant interatomic interactions. In this work, we propose several innovations to overcome these limitations. First, we propose to model physics-principled interatomic potentials directly instead of only using distances as in many existing methods. These potentials include the Coulomb potential, London dispersion potential, and Pauli repulsion potential. Second, we model the complete set of potentials among all atoms, instead of only between nearby atoms as in existing methods. This is enabled by our approximations of infinite potential summations, where we extend the Ewald summation for several potential series approximations with provable error bounds. Finally, we propose to incorporate our computations of complete interatomic potentials into message passing neural networks for representation learning. We perform experiments on the JARVIS and Materials Project benchmarks for evaluation. Results show that the use of interatomic potentials and complete interatomic potentials leads to consistent performance improvements with reasonable computational costs. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS/tree/main/OpenMat/PotNet).
    How Reliable is Your Regression Model's Uncertainty Under Real-World Distribution Shifts?. (arXiv:2302.03679v2 [cs.LG] UPDATED)
    Many important computer vision applications are naturally formulated as regression problems. Within medical imaging, accurate regression models have the potential to automate various tasks, helping to lower costs and improve patient outcomes. Such safety-critical deployment does however require reliable estimation of model uncertainty, also under the wide variety of distribution shifts that might be encountered in practice. Motivated by this, we set out to investigate the reliability of regression uncertainty estimation methods under various real-world distribution shifts. To that end, we propose an extensive benchmark of 8 image-based regression datasets with different types of challenging distribution shifts. We then employ our benchmark to evaluate many of the most common uncertainty estimation methods, as well as two state-of-the-art uncertainty scores from the task of out-of-distribution detection. We find that while methods are well calibrated when there is no distribution shift, they all become highly overconfident on many of the benchmark datasets. This uncovers important limitations of current uncertainty estimation methods, and the proposed benchmark therefore serves as a challenge to the research community. We hope that our benchmark will spur more work on how to develop truly reliable regression uncertainty estimation methods. Code is available at https://github.com/fregu856/regression_uncertainty.
    Online learning of long-range dependencies. (arXiv:2305.15947v2 [cs.LG] UPDATED)
    Online learning holds the promise of enabling efficient long-term credit assignment in recurrent neural networks. However, current algorithms fall short of offline backpropagation by either not being scalable or failing to learn long-range dependencies. Here we present a high-performance online learning algorithm that merely doubles the memory and computational requirements of a single inference pass. We achieve this by leveraging independent recurrent modules in multi-layer networks, an architectural motif that has recently been shown to be particularly powerful. Experiments on synthetic memory problems and on the challenging long-range arena benchmark suite reveal that our algorithm performs competitively, establishing a new standard for what can be achieved through online learning. This ability to learn long-range dependencies offers a new perspective on learning in the brain and opens a promising avenue in neuromorphic computing.
    ClimateSet: A Large-Scale Climate Model Dataset for Machine Learning. (arXiv:2311.03721v1 [cs.LG])
    Climate models have been key for assessing the impact of climate change and simulating future climate scenarios. The machine learning (ML) community has taken an increased interest in supporting climate scientists' efforts on various tasks such as climate model emulation, downscaling, and prediction tasks. Many of those tasks have been addressed on datasets created with single climate models. However, both the climate science and ML communities have suggested that to address those tasks at scale, we need large, consistent, and ML-ready climate model datasets. Here, we introduce ClimateSet, a dataset containing the inputs and outputs of 36 climate models from the Input4MIPs and CMIP6 archives. In addition, we provide a modular dataset pipeline for retrieving and preprocessing additional climate models and scenarios. We showcase the potential of our dataset by using it as a benchmark for ML-based climate model emulation. We gain new insights about the performance and generalization capabilities of the different ML models by analyzing their performance across different climate models. Furthermore, the dataset can be used to train an ML emulator on several climate models instead of just one. Such a "super emulator" can quickly project new climate change scenarios, complementing existing scenarios already provided to policymakers. We believe ClimateSet will create the basis needed for the ML community to tackle climate-related tasks at scale.
    Task-Driven Detection of Distribution Shifts with Statistical Guarantees for Robot Learning. (arXiv:2106.13703v6 [cs.RO] UPDATED)
    Our goal is to perform out-of-distribution (OOD) detection, i.e., to detect when a robot is operating in environments drawn from a different distribution than the ones used to train the robot. We leverage Probably Approximately Correct (PAC)-Bayes theory to train a policy with a guaranteed bound on performance on the training distribution. Our idea for OOD detection relies on the following intuition: violation of the performance bound on test environments provides evidence that the robot is operating OOD. We formalize this via statistical techniques based on p-values and concentration inequalities. The approach provides guaranteed confidence bounds on OOD detection including bounds on both the false positive and false negative rates of the detector and is task-driven and only sensitive to changes that impact the robot's performance. We demonstrate our approach in simulation and hardware for a grasping task using objects with unfamiliar shapes or poses and a drone performing vision-based obstacle avoidance in environments with wind disturbances and varied obstacle densities. Our examples demonstrate that we can perform task-driven OOD detection within just a handful of trials.  ( 3 min )
    Dynamic Non-monotone Submodular Maximization. (arXiv:2311.03685v1 [cs.DS])
    Maximizing submodular functions has been increasingly used in many applications of machine learning, such as data summarization, recommendation systems, and feature selection. Moreover, there has been a growing interest in both submodular maximization and dynamic algorithms. In 2020, Monemizadeh and Lattanzi, Mitrovic, Norouzi{-}Fard, Tarnawski, and Zadimoghaddam initiated developing dynamic algorithms for the monotone submodular maximization problem under the cardinality constraint $k$. Recently, there have been some improvements on the topic made by Banihashem, Biabani, Goudarzi, Hajiaghayi, Jabbarzade, and Monemizadeh. In 2022, Chen and Peng studied the complexity of this problem and raised an important open question: "Can we extend [fully dynamic] results (algorithm or hardness) to non-monotone submodular maximization?". We affirmatively answer their question by demonstrating a reduction from maximizing a non-monotone submodular function under the cardinality constraint $k$ to maximizing a monotone submodular function under the same constraint. Through this reduction, we obtain the first dynamic algorithms to solve the non-monotone submodular maximization problem under the cardinality constraint $k$. Our algorithms maintain an $(8+\epsilon)$-approximate of the solution and use expected amortized $O(\epsilon^{-3}k^3\log^3(n)\log(k))$ or $O(\epsilon^{-1}k^2\log^3(k))$ oracle queries per update, respectively. Furthermore, we showcase the benefits of our dynamic algorithm for video summarization and max-cut problems on several real-world data sets.  ( 2 min )
    Are Words Enough? On the semantic conditioning of affective music generation. (arXiv:2311.03624v1 [cs.MM])
    Music has been commonly recognized as a means of expressing emotions. In this sense, an intense debate emerges from the need to verbalize musical emotions. This concern seems highly relevant today, considering the exponential growth of natural language processing using deep learning models where it is possible to prompt semantic propositions to generate music automatically. This scoping review aims to analyze and discuss the possibilities of music generation conditioned by emotions. To address this topic, we propose a historical perspective that encompasses the different disciplines and methods contributing to this topic. In detail, we review two main paradigms adopted in automatic music generation: rules-based and machine-learning models. Of note are the deep learning architectures that aim to generate high-fidelity music from textual descriptions. These models raise fundamental questions about the expressivity of music, including whether emotions can be represented with words or expressed through them. We conclude that overcoming the limitation and ambiguity of language to express emotions through music, some of the use of deep learning with natural language has the potential to impact the creative industries by providing powerful tools to prompt and generate new musical works.
    Plug-and-Play Stability for Intracortical Brain-Computer Interfaces: A One-Year Demonstration of Seamless Brain-to-Text Communication. (arXiv:2311.03611v1 [cs.HC])
    Intracortical brain-computer interfaces (iBCIs) have shown promise for restoring rapid communication to people with neurological disorders such as amyotrophic lateral sclerosis (ALS). However, to maintain high performance over time, iBCIs typically need frequent recalibration to combat changes in the neural recordings that accrue over days. This requires iBCI users to stop using the iBCI and engage in supervised data collection, making the iBCI system hard to use. In this paper, we propose a method that enables self-recalibration of communication iBCIs without interrupting the user. Our method leverages large language models (LMs) to automatically correct errors in iBCI outputs. The self-recalibration process uses these corrected outputs ("pseudo-labels") to continually update the iBCI decoder online. Over a period of more than one year (403 days), we evaluated our Continual Online Recalibration with Pseudo-labels (CORP) framework with one clinical trial participant. CORP achieved a stable decoding accuracy of 93.84% in an online handwriting iBCI task, significantly outperforming other baseline methods. Notably, this is the longest-running iBCI stability demonstration involving a human participant. Our results provide the first evidence for long-term stabilization of a plug-and-play, high-performance communication iBCI, addressing a major barrier for the clinical translation of iBCIs.
    Towards Accelerated Model Training via Bayesian Data Selection. (arXiv:2308.10544v3 [cs.LG] UPDATED)
    Mislabeled, duplicated, or biased data in real-world scenarios can lead to prolonged training and even hinder model convergence. Traditional solutions prioritizing easy or hard samples lack the flexibility to handle such a variety simultaneously. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. However, its practical adoption relies on less principled approximations and additional holdout data. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models. The resulting algorithm is efficient and easy to implement. We perform extensive empirical studies on challenging benchmarks with considerable data noise and imbalance in the online batch selection scenario, and observe superior training efficiency over competitive baselines. Notably, on the challenging WebVision benchmark, our method can achieve similar predictive performance with significantly fewer training iterations than leading data selection methods.
    Innovation and Word Usage Patterns in Machine Learning. (arXiv:2311.03633v1 [cs.LG])
    In this study, we delve into the dynamic landscape of machine learning research evolution. Initially, through the utilization of Latent Dirichlet Allocation, we discern pivotal themes and fundamental concepts that have emerged within the realm of machine learning. Subsequently, we undertake a comprehensive analysis to track the evolutionary trajectories of these identified themes. To quantify the novelty and divergence of research contributions, we employ the Kullback-Leibler Divergence metric. This statistical measure serves as a proxy for ``surprise'', indicating the extent of differentiation between the content of academic papers and the subsequent developments in research. By amalgamating these insights, we gain the ability to ascertain the pivotal roles played by prominent researchers and the significance of specific academic venues (periodicals and conferences) within the machine learning domain.
    Loss Balancing for Fair Supervised Learning. (arXiv:2311.03714v1 [cs.LG])
    Supervised learning models have been used in various domains such as lending, college admission, face recognition, natural language processing, etc. However, they may inherit pre-existing biases from training data and exhibit discrimination against protected social groups. Various fairness notions have been proposed to address unfairness issues. In this work, we focus on Equalized Loss (EL), a fairness notion that requires the expected loss to be (approximately) equalized across different groups. Imposing EL on the learning process leads to a non-convex optimization problem even if the loss function is convex, and the existing fair learning algorithms cannot properly be adopted to find the fair predictor under the EL constraint. This paper introduces an algorithm that can leverage off-the-shelf convex programming tools (e.g., CVXPY) to efficiently find the global optimum of this non-convex optimization. In particular, we propose the ELminimizer algorithm, which finds the optimal fair predictor under EL by reducing the non-convex optimization to a sequence of convex optimization problems. We theoretically prove that our algorithm finds the global optimal solution under certain conditions. Then, we support our theoretical results through several empirical studies.
    The Future of Consumer Edge-AI Computing. (arXiv:2210.10514v2 [cs.LG] UPDATED)
    In the last decade, Deep Learning has rapidly infiltrated the consumer end, mainly thanks to hardware acceleration across devices. However, as we look towards the future, it is evident that isolated hardware will be insufficient. Increasingly complex AI tasks demand shared resources, cross-device collaboration, and multiple data types, all without compromising user privacy or quality of experience. To address this, we introduce a novel paradigm centered around EdgeAI-Hub devices, designed to reorganise and optimise compute resources and data access at the consumer edge. To this end, we lay a holistic foundation for the transition from on-device to Edge-AI serving systems in consumer environments, detailing their components, structure, challenges and opportunities.
    A Simple and Efficient Baseline for Data Attribution on Images. (arXiv:2311.03386v1 [cs.CV])
    Data attribution methods play a crucial role in understanding machine learning models, providing insight into which training data points are most responsible for model outputs during deployment. However, current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions. These approaches therefore come at a high computational cost, are memory intensive, and are hard to scale to large models or datasets. In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution. Our method is model-agnostic and scales easily to large datasets. We show results on CIFAR-10 and ImageNet, achieving strong performance that rivals or outperforms state-of-the-art approaches at a fraction of the compute or memory cost. Contrary to prior work, our results reinforce the intuition that a model's prediction on one image is most impacted by visually similar training samples. Our approach serves as a simple and efficient baseline for data attribution on images.  ( 2 min )
    Hypothesis Network Planned Exploration for Rapid Meta-Reinforcement Learning Adaptation. (arXiv:2311.03701v1 [cs.AI])
    Meta Reinforcement Learning (Meta RL) trains agents that adapt to fast-changing environments and tasks. Current strategies often lose adaption efficiency due to the passive nature of model exploration, causing delayed understanding of new transition dynamics. This results in particularly fast-evolving tasks being impossible to solve. We propose a novel approach, Hypothesis Network Planned Exploration (HyPE), that integrates an active and planned exploration process via the hypothesis network to optimize adaptation speed. HyPE uses a generative hypothesis network to form potential models of state transition dynamics, then eliminates incorrect models through strategically devised experiments. Evaluated on a symbolic version of the Alchemy game, HyPE outpaces baseline methods in adaptation speed and model accuracy, validating its potential in enhancing reinforcement learning adaptation in rapidly evolving settings.
    Leveraging Generative AI: Improving Software Metadata Classification with Generated Code-Comment Pairs. (arXiv:2311.03365v1 [cs.SE])
    In software development, code comments play a crucial role in enhancing code comprehension and collaboration. This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful." We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process. We address this task by incorporating generated code and comment pairs. The initial dataset comprised 9048 pairs of code and comments written in C, labeled as either Useful or Not Useful. To augment this dataset, we sourced an additional 739 lines of code-comment pairs and generated labels using a Large Language Model Architecture, specifically BERT. The primary objective was to build classification models that can effectively differentiate between useful and not useful code comments. Various machine learning algorithms were employed, including Logistic Regression, Decision Tree, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Gradient Boosting, Random Forest, and a Neural Network. Each algorithm was evaluated using precision, recall, and F1-score metrics, both with the original seed dataset and the augmented dataset. This study showcases the potential of generative AI for enhancing binary code comment quality classification models, providing valuable insights for software developers and researchers in the field of natural language processing and software engineering.  ( 3 min )
    Mobile Augmented Reality with Federated Learning in the Metaverse. (arXiv:2212.08324v2 [cs.LG] UPDATED)
    The Metaverse is deemed the next evolution of the Internet and has received much attention recently. Metaverse applications via mobile augmented reality (MAR) require rapid and accurate object detection to mix digital data with the real world. As mobile devices evolve, their computational capabilities are increasing, and thus their computational resources can be leveraged to train machine learning models. In light of the increasing concerns of user privacy and data security, federated learning (FL) has become a promising distributed learning framework for privacy-preserving analytics. In this article, FL and MAR are brought together in the Metaverse. We discuss the necessity and rationality of the combination of FL and MAR. The prospective technologies that support FL and MAR in the Metaverse are also discussed. In addition, existing challenges that prevent the fulfillment of FL and MAR in the Metaverse and several application scenarios are presented. Finally, three case studies of Metaverse FL-MAR systems are demonstrated.
    Separating and Learning Latent Confounders to Enhancing User Preferences Modeling. (arXiv:2311.03381v1 [cs.IR])
    Recommender models aim to capture user preferences from historical feedback and then predict user-specific feedback on candidate items. However, the presence of various unmeasured confounders causes deviations between the user preferences in the historical feedback and the true preferences, resulting in models not meeting their expected performance. Existing debias models either (1) specific to solving one particular bias or (2) directly obtain auxiliary information from user historical feedback, which cannot identify whether the learned preferences are true user preferences or mixed with unmeasured confounders. Moreover, we find that the former recommender system is not only a successor to unmeasured confounders but also acts as an unmeasured confounder affecting user preference modeling, which has always been neglected in previous studies. To this end, we incorporate the effect of the former recommender system and treat it as a proxy for all unmeasured confounders. We propose a novel framework, \textbf{S}eparating and \textbf{L}earning Latent Confounders \textbf{F}or \textbf{R}ecommendation (\textbf{SLFR}), which obtains the representation of unmeasured confounders to identify the counterfactual feedback by disentangling user preferences and unmeasured confounders, then guides the target model to capture the true preferences of users. Extensive experiments in five real-world datasets validate the advantages of our method.
    OpenGSL: A Comprehensive Benchmark for Graph Structure Learning. (arXiv:2306.10280v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have emerged as the de facto standard for representation learning on graphs, owing to their ability to effectively integrate graph topology and node attributes. However, the inherent suboptimal nature of node connections, resulting from the complex and contingent formation process of graphs, presents significant challenges in modeling them effectively. To tackle this issue, Graph Structure Learning (GSL), a family of data-centric learning approaches, has garnered substantial attention in recent years. The core concept behind GSL is to jointly optimize the graph structure and the corresponding GNN models. Despite the proposal of numerous GSL methods, the progress in this field remains unclear due to inconsistent experimental protocols, including variations in datasets, data processing techniques, and splitting strategies. In this paper, we introduce OpenGSL, the first comprehensive benchmark for GSL, aimed at addressing this gap. OpenGSL enables a fair comparison among state-of-the-art GSL methods by evaluating them across various popular datasets using uniform data processing and splitting strategies. Through extensive experiments, we observe that existing GSL methods do not consistently outperform vanilla GNN counterparts. We also find that there is no significant correlation between the homophily of the learned structure and task performance, challenging the common belief. Moreover, we observe that the learned graph structure demonstrates a strong generalization ability across different GNN models, despite the high computational and space consumption. We hope that our open-sourced library will facilitate rapid and equitable evaluation and inspire further innovative research in this field. The code of the benchmark can be found in https://github.com/OpenGSL/OpenGSL.
    Explicit Planning Helps Language Models in Logical Reasoning. (arXiv:2303.15714v4 [cs.CL] UPDATED)
    Language models have been shown to perform remarkably well on a wide range of natural language processing tasks. In this paper, we propose LEAP, a novel system that uses language models to perform multi-step logical reasoning and incorporates explicit planning into the inference procedure. Explicit planning enables the system to make more informed reasoning decisions at each step by looking ahead into their future effects. Moreover, we propose a training strategy that safeguards the planning process from being led astray by spurious features. Our full system significantly outperforms other competing methods on multiple standard datasets. When using small T5 models as its core selection and deduction components, our system performs competitively compared to GPT-3 despite having only about 1B parameters (i.e., 175 times smaller than GPT-3). When using GPT-3.5, it significantly outperforms chain-of-thought prompting on the challenging PrOntoQA dataset. We have conducted extensive empirical studies to demonstrate that explicit planning plays a crucial role in the system's performance.
    Manifold learning: what, how, and why. (arXiv:2311.03757v1 [stat.ML])
    Manifold learning (ML), known also as non-linear dimension reduction, is a set of methods to find the low dimensional structure of data. Dimension reduction for large, high dimensional data is not merely a way to reduce the data; the new representations and descriptors obtained by ML reveal the geometric shape of high dimensional point clouds, and allow one to visualize, de-noise and interpret them. This survey presents the principles underlying ML, the representative methods, as well as their statistical foundations from a practicing statistician's perspective. It describes the trade-offs, and what theory tells us about the parameter and algorithmic choices we make in order to obtain reliable conclusions.
    Toward Reinforcement Learning-based Rectilinear Macro Placement Under Human Constraints. (arXiv:2311.03383v1 [cs.LG])
    Macro placement is a critical phase in chip design, which becomes more intricate when involving general rectilinear macros and layout areas. Furthermore, macro placement that incorporates human-like constraints, such as design hierarchy and peripheral bias, has the potential to significantly reduce the amount of additional manual labor required from designers. This study proposes a methodology that leverages an approach suggested by Google's Circuit Training (G-CT) to provide a learning-based macro placer that not only supports placing rectilinear cases, but also adheres to crucial human-like design principles. Our experimental results demonstrate the effectiveness of our framework in achieving power-performance-area (PPA) metrics and in obtaining placements of high quality, comparable to those produced with human intervention. Additionally, our methodology shows potential as a generalized model to address diverse macro shapes and layout areas.
    Learned Causal Method Prediction. (arXiv:2311.03989v1 [cs.LG])
    For a given causal question, it is important to efficiently decide which causal inference method to use for a given dataset. This is challenging because causal methods typically rely on complex and difficult-to-verify assumptions, and cross-validation is not applicable since ground truth causal quantities are unobserved.In this work, we propose CAusal Method Predictor (CAMP), a framework for predicting the best method for a given dataset. To this end, we generate datasets from a diverse set of synthetic causal models, score the candidate methods, and train a model to directly predict the highest-scoring method for that dataset. Next, by formulating a self-supervised pre-training objective centered on dataset assumptions relevant for causal inference, we significantly reduce the need for costly labeled data and enhance training efficiency. Our strategy learns to map implicit dataset properties to the best method in a data-driven manner. In our experiments, we focus on method prediction for causal discovery. CAMP outperforms selecting any individual candidate method and demonstrates promising generalization to unseen semi-synthetic and real-world benchmarks.
    Image Amodal Completion: A Survey. (arXiv:2207.02062v3 [cs.CV] UPDATED)
    Existing computer vision systems can compete with humans in understanding the visible parts of objects, but still fall far short of humans when it comes to depicting the invisible parts of partially occluded objects. Image amodal completion aims to equip computers with human-like amodal completion functions to understand an intact object despite it being partially occluded. The main purpose of this survey is to provide an intuitive understanding of the research hotspots, key technologies and future trends in the field of image amodal completion. Firstly, we present a comprehensive review of the latest literature in this emerging field, exploring three key tasks in image amodal completion, including amodal shape completion, amodal appearance completion, and order perception. Then we examine popular datasets related to image amodal completion along with their common data collection methods and evaluation metrics. Finally, we discuss real-world applications and future research directions for image amodal completion, facilitating the reader's understanding of the challenges of existing technologies and upcoming research trends.
    Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data. (arXiv:2311.00136v2 [q-bio.NC] UPDATED)
    State-of-the-art systems neuroscience experiments yield large-scale multimodal data, and these data sets require new tools for analysis. Inspired by the success of large pretrained models in vision and language domains, we reframe the analysis of large-scale, cellular-resolution neuronal spiking data into an autoregressive spatiotemporal generation problem. Neuroformer is a multimodal, multitask generative pretrained transformer (GPT) model that is specifically designed to handle the intricacies of data in systems neuroscience. It scales linearly with feature size, can process an arbitrary number of modalities, and is adaptable to downstream tasks, such as predicting behavior. We first trained Neuroformer on simulated datasets, and found that it both accurately predicted simulated neuronal circuit activity, and also intrinsically inferred the underlying neural circuit connectivity, including direction. When pretrained to decode neural responses, the model predicted the behavior of a mouse with only few-shot fine-tuning, suggesting that the model begins learning how to do so directly from the neural representations themselves, without any explicit supervision. We used an ablation study to show that joint training on neuronal responses and behavior boosted performance, highlighting the model's ability to associate behavioral and neural representations in an unsupervised manner. These findings show that Neuroformer can analyze neural datasets and their emergent properties, informing the development of models and hypotheses associated with the brain.
    Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models. (arXiv:2311.03687v1 [cs.PF])
    Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and deploying LLMs are expensive as it requires considerable computing resources and memory, hence many efficient approaches have been developed for improving system pipelines as well as operators. However, the runtime performance can vary significantly across hardware and software stacks, which makes it difficult to choose the best configuration. In this work, we aim to benchmark the performance from both macro and micro perspectives. First, we benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes , i.e., 7, 13, and 70 billion parameters (7B, 13B, and 70B) on three 8-GPU platforms with and without individual optimization techniques, including ZeRO, quantization, recomputation, FlashAttention. Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs. For end users, our benchmark and findings help better understand different optimization techniques, training and inference frameworks, together with hardware platforms in choosing configurations for deploying LLMs. For researchers, our in-depth module-wise analyses discover potential opportunities for future work to further optimize the runtime performance of LLMs.
    Mitigating Estimation Errors by Twin TD-Regularized Actor and Critic for Deep Reinforcement Learning. (arXiv:2311.03711v1 [cs.LG])
    We address the issue of estimation bias in deep reinforcement learning (DRL) by introducing solution mechanisms that include a new, twin TD-regularized actor-critic (TDR) method. It aims at reducing both over and under-estimation errors. With TDR and by combining good DRL improvements, such as distributional learning and long N-step surrogate stage reward (LNSS) method, we show that our new TDR-based actor-critic learning has enabled DRL methods to outperform their respective baselines in challenging environments in DeepMind Control Suite. Furthermore, they elevate TD3 and SAC respectively to a level of performance comparable to that of D4PG (the current SOTA), and they also improve the performance of D4PG to a new SOTA level measured by mean reward, convergence speed, learning success rate, and learning variance.
    Convergence of Adam Under Relaxed Assumptions. (arXiv:2304.13972v3 [math.OC] UPDATED)
    In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. Despite the popularity and efficiency of the Adam algorithm in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this paper, we show that Adam provably converges to $\epsilon$-stationary points with ${O}(\epsilon^{-4})$ gradient complexity under far more realistic conditions. The key to our analysis is a new proof of boundedness of gradients along the optimization trajectory of Adam, under a generalized smoothness assumption according to which the local smoothness (i.e., Hessian norm when it exists) is bounded by a sub-quadratic function of the gradient norm. Moreover, we propose a variance-reduced version of Adam with an accelerated gradient complexity of ${O}(\epsilon^{-3})$.  ( 2 min )
    Neural MMO 2.0: A Massively Multi-task Addition to Massively Multi-agent Learning. (arXiv:2311.03736v1 [cs.AI])
    Neural MMO 2.0 is a massively multi-agent environment for reinforcement learning research. The key feature of this new version is a flexible task system that allows users to define a broad range of objectives and reward signals. We challenge researchers to train agents capable of generalizing to tasks, maps, and opponents never seen during training. Neural MMO features procedurally generated maps with 128 agents in the standard setting and support for up to. Version 2.0 is a complete rewrite of its predecessor with three-fold improved performance and compatibility with CleanRL. We release the platform as free and open-source software with comprehensive documentation available at neuralmmo.github.io and an active community Discord. To spark initial research on this new platform, we are concurrently running a competition at NeurIPS 2023.
    Guaranteed Conformance of Neurosymbolic Models to Natural Constraints. (arXiv:2212.01346v8 [cs.LG] UPDATED)
    Deep neural networks have emerged as the workhorse for a large section of robotics and control applications, especially as models for dynamical systems. Such data-driven models are in turn used for designing and verifying autonomous systems. They are particularly useful in modeling medical systems where data can be leveraged to individualize treatment. In safety-critical applications, it is important that the data-driven model is conformant to established knowledge from the natural sciences. Such knowledge is often available or can often be distilled into a (possibly black-box) model. For instance, an F1 racing car should conform to Newton's laws (which are encoded within a unicycle model). In this light, we consider the following problem - given a model $M$ and a state transition dataset, we wish to best approximate the system model while being a bounded distance away from $M$. We propose a method to guarantee this conformance. Our first step is to distill the dataset into a few representative samples called memories, using the idea of a growing neural gas. Next, using these memories we partition the state space into disjoint subsets and compute bounds that should be respected by the neural network in each subset. This serves as a symbolic wrapper for guaranteed conformance. We argue theoretically that this only leads to a bounded increase in approximation error; which can be controlled by increasing the number of memories. We experimentally show that on three case studies (Car Model, Drones, and Artificial Pancreas), our constrained neurosymbolic models conform to specified models (each encoding various constraints) with order-of-magnitude improvements compared to the augmented Lagrangian and vanilla training methods. Our code can be found at: https://github.com/kaustubhsridhar/Constrained_Models
    Illumination Variation Correction Using Image Synthesis For Unsupervised Domain Adaptive Person Re-Identification. (arXiv:2301.09702v3 [eess.IV] UPDATED)
    Unsupervised domain adaptive (UDA) person re-identification (re-ID) aims to learn identity information from labeled images in source domains and apply it to unlabeled images in a target domain. One major issue with many unsupervised re-identification methods is that they do not perform well relative to large domain variations such as illumination, viewpoint, and occlusions. In this paper, we propose a Synthesis Model Bank (SMB) to deal with illumination variation in unsupervised person re-ID. The proposed SMB consists of several convolutional neural networks (CNN) for feature extraction and Mahalanobis matrices for distance metrics. They are trained using synthetic data with different illumination conditions such that their synergistic effect makes the SMB robust against illumination variation. To better quantify the illumination intensity and improve the quality of synthetic images, we introduce a new 3D virtual-human dataset for GAN-based image synthesis. From our experiments, the proposed SMB outperforms other synthesis methods on several re-ID benchmarks.
    Measuring Adversarial Datasets. (arXiv:2311.03566v1 [cs.LG])
    In the era of widespread public use of AI systems across various domains, ensuring adversarial robustness has become increasingly vital to maintain safety and prevent undesirable errors. Researchers have curated various adversarial datasets (through perturbations) for capturing model deficiencies that cannot be revealed in standard benchmark datasets. However, little is known about how these adversarial examples differ from the original data points, and there is still no methodology to measure the intended and unintended consequences of those adversarial transformations. In this research, we conducted a systematic survey of existing quantifiable metrics that describe text instances in NLP tasks, among dimensions of difficulty, diversity, and disagreement. We selected several current adversarial effect datasets and compared the distributions between the original and their adversarial counterparts. The results provide valuable insights into what makes these datasets more challenging from a metrics perspective and whether they align with underlying assumptions.
    A graph convolutional autoencoder approach to model order reduction for parametrized PDEs. (arXiv:2305.08573v2 [math.NA] UPDATED)
    The present work proposes a framework for nonlinear model order reduction based on a Graph Convolutional Autoencoder (GCA-ROM). In the reduced order modeling (ROM) context, one is interested in obtaining real-time and many-query evaluations of parametric Partial Differential Equations (PDEs). Linear techniques such as Proper Orthogonal Decomposition (POD) and Greedy algorithms have been analyzed thoroughly, but they are more suitable when dealing with linear and affine models showing a fast decay of the Kolmogorov n-width. On one hand, the autoencoder architecture represents a nonlinear generalization of the POD compression procedure, allowing one to encode the main information in a latent set of variables while extracting their main features. On the other hand, Graph Neural Networks (GNNs) constitute a natural framework for studying PDE solutions defined on unstructured meshes. Here, we develop a non-intrusive and data-driven nonlinear reduction approach, exploiting GNNs to encode the reduced manifold and enable fast evaluations of parametrized PDEs. We show the capabilities of the methodology for several models: linear/nonlinear and scalar/vector problems with fast/slow decay in the physically and geometrically parametrized setting. The main properties of our approach consist of (i) high generalizability in the low-data regime even for complex regimes, (ii) physical compliance with general unstructured grids, and (iii) exploitation of pooling and un-pooling operations to learn from scattered data.  ( 3 min )
    Curating Naturally Adversarial Datasets for Learning-Enabled Medical Cyber-Physical Systems. (arXiv:2309.00543v2 [cs.LG] UPDATED)
    Deep learning models have shown promising predictive accuracy for time-series healthcare applications. However, ensuring the robustness of these models is vital for building trustworthy AI systems. Existing research predominantly focuses on robustness to synthetic adversarial examples, crafted by adding imperceptible perturbations to clean input data. However, these synthetic adversarial examples do not accurately reflect the most challenging real-world scenarios, especially in the context of healthcare data. Consequently, robustness to synthetic adversarial examples may not necessarily translate to robustness against naturally occurring adversarial examples, which is highly desirable for trustworthy AI. We propose a method to curate datasets comprised of natural adversarial examples to evaluate model robustness. The method relies on probabilistic labels obtained from automated weakly-supervised labeling that combines noisy and cheap-to-obtain labeling heuristics. Based on these labels, our method adversarially orders the input data and uses this ordering to construct a sequence of increasingly adversarial datasets. Our evaluation on six medical case studies and three non-medical case studies demonstrates the efficacy and statistical validity of our approach to generating naturally adversarial datasets
    Learning-Based Optimal Control with Performance Guarantees for Unknown Systems with Latent States. (arXiv:2303.17963v2 [eess.SY] UPDATED)
    As control engineering methods are applied to increasingly complex systems, data-driven approaches for system identification appear as a promising alternative to physics-based modeling. While the Bayesian approaches prevalent for safety-critical applications usually rely on the availability of state measurements, the states of a complex system are often not directly measurable. It may then be necessary to jointly estimate the dynamics and the latent state, making the quantification of uncertainties and the design of controllers with formal performance guarantees considerably more challenging. This paper proposes a novel method for the computation of an optimal input trajectory for unknown nonlinear systems with latent states based on a combination of particle Markov chain Monte Carlo methods and scenario theory. Probabilistic performance guarantees are derived for the resulting input trajectory, and an approach to validate the performance of arbitrary control laws is presented. The effectiveness of the proposed method is demonstrated in a numerical simulation.
    DynGFN: Towards Bayesian Inference of Gene Regulatory Networks with GFlowNets. (arXiv:2302.04178v3 [cs.LG] UPDATED)
    One of the grand challenges of cell biology is inferring the gene regulatory network (GRN) which describes interactions between genes and their products that control gene expression and cellular function. We can treat this as a causal discovery problem but with two non-standard challenges: (1) regulatory networks are inherently cyclic so we should not model a GRN as a directed acyclic graph (DAG), and (2) observations have significant measurement noise, so for typical sample sizes there will always be a large equivalence class of graphs that are likely given the data, and we want methods that capture this uncertainty. Existing methods either focus on challenge (1), identifying cyclic structure from dynamics, or on challenge (2) learning complex Bayesian posteriors over DAGs, but not both. In this paper we leverage the fact that it is possible to estimate the "velocity" of gene expression with RNA velocity techniques to develop an approach that addresses both challenges. Because we have access to velocity information, we can treat the Bayesian structure learning problem as a problem of sparse identification of a dynamical system, capturing cyclic feedback loops through time. Since our objective is to model uncertainty over discrete structures, we leverage Generative Flow Networks (GFlowNets) to estimate the posterior distribution over the combinatorial space of possible sparse dependencies. Our results indicate that our method learns posteriors that better encapsulate the distributions of cyclic structures compared to counterpart state-of-the-art Bayesian structure learning approaches.
    DealMVC: Dual Contrastive Calibration for Multi-view Clustering. (arXiv:2308.09000v3 [cs.CV] UPDATED)
    Benefiting from the strong view-consistent information mining capacity, multi-view contrastive clustering has attracted plenty of attention in recent years. However, we observe the following drawback, which limits the clustering performance from further improvement. The existing multi-view models mainly focus on the consistency of the same samples in different views while ignoring the circumstance of similar but different samples in cross-view scenarios. To solve this problem, we propose a novel Dual contrastive calibration network for Multi-View Clustering (DealMVC). Specifically, we first design a fusion mechanism to obtain a global cross-view feature. Then, a global contrastive calibration loss is proposed by aligning the view feature similarity graph and the high-confidence pseudo-label graph. Moreover, to utilize the diversity of multi-view information, we propose a local contrastive calibration loss to constrain the consistency of pair-wise view features. The feature structure is regularized by reliable class information, thus guaranteeing similar samples have similar features in different views. During the training procedure, the interacted cross-view feature is jointly optimized at both local and global levels. In comparison with other state-of-the-art approaches, the comprehensive experimental results obtained from eight benchmark datasets provide substantial validation of the effectiveness and superiority of our algorithm. We release the code of DealMVC at https://github.com/xihongyang1999/DealMVC on GitHub.
    MoleCLUEs: Molecular Conformers Maximally In-Distribution for Predictive Models. (arXiv:2306.11681v2 [cs.LG] UPDATED)
    Structure-based molecular ML (SBML) models can be highly sensitive to input geometries and give predictions with large variance. We present an approach to mitigate the challenge of selecting conformations for such models by generating conformers that explicitly minimize predictive uncertainty. To achieve this, we compute estimates of aleatoric and epistemic uncertainties that are differentiable w.r.t. latent posteriors. We then iteratively sample new latents in the direction of lower uncertainty by gradient descent. As we train our predictive models jointly with a conformer decoder, the new latent embeddings can be mapped to their corresponding inputs, which we call \textit{MoleCLUEs}, or (molecular) counterfactual latent uncertainty explanations \citep{antoran2020getting}. We assess our algorithm for the task of predicting drug properties from 3D structure with maximum confidence. We additionally analyze the structure trajectories obtained from conformer optimizations, which provide insight into the sources of uncertainty in SBML.
    On efficient algorithms for computing near-best polynomial approximations to high-dimensional, Hilbert-valued functions from limited samples. (arXiv:2203.13908v2 [math.NA] UPDATED)
    Sparse polynomial approximation has become indispensable for approximating smooth, high- or infinite-dimensional functions from limited samples. This is a key task in computational science and engineering, e.g., surrogate modelling in uncertainty quantification where the function is the solution map of a parametric or stochastic differential equation (DE). Yet, sparse polynomial approximation lacks a complete theory. On the one hand, there is a well-developed theory of best $s$-term polynomial approximation, which asserts exponential or algebraic rates of convergence for holomorphic functions. On the other, there are increasingly mature methods such as (weighted) $\ell^1$-minimization for computing such approximations. While the sample complexity of these methods has been analyzed with compressed sensing, whether they achieve best $s$-term approximation rates is not fully understood. Furthermore, these methods are not algorithms per se, as they involve exact minimizers of nonlinear optimization problems. This paper closes these gaps. Specifically, we consider the following question: are there robust, efficient algorithms for computing approximations to finite- or infinite-dimensional, holomorphic and Hilbert-valued functions from limited samples that achieve best $s$-term rates? We answer this affirmatively by introducing algorithms and theoretical guarantees that assert exponential or algebraic rates of convergence, along with robustness to sampling, algorithmic, and physical discretization errors. We tackle both scalar- and Hilbert-valued functions, this being key to parametric or stochastic DEs. Our results involve significant developments of existing techniques, including a novel restarted primal-dual iteration for solving weighted $\ell^1$-minimization problems in Hilbert spaces. Our theory is supplemented by numerical experiments demonstrating the efficacy of these algorithms.
    An Initialization Schema for Neuronal Networks on Tabular Data. (arXiv:2311.03996v1 [cs.LG])
    Nowadays, many modern applications require heterogeneous tabular data, which is still a challenging task in terms of regression and classification. Many approaches have been proposed to adapt neural networks for this task, but still, boosting and bagging of decision trees are the best-performing methods for this task. In this paper, we show that a binomial initialized neural network can be used effectively on tabular data. The proposed approach shows a simple but effective approach for initializing the first hidden layer in neural networks. We also show that this initializing schema can be used to jointly train ensembles by adding gradient masking to batch entries and using the binomial initialization for the last layer in a neural network. For this purpose, we modified the hinge binary loss and the soft max loss to make them applicable for joint ensemble training. We evaluate our approach on multiple public datasets and showcase the improved performance compared to other neural network-based approaches. In addition, we discuss the limitations and possible further research of our approach for improving the applicability of neural networks to tabular data. Link: https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=%2FInitializationNeuronalNetworksTabularData&mode=list
    FLORA: Fine-grained Low-Rank Architecture Search for Vision Transformer. (arXiv:2311.03912v1 [cs.CV])
    Vision Transformers (ViT) have recently demonstrated success across a myriad of computer vision tasks. However, their elevated computational demands pose significant challenges for real-world deployment. While low-rank approximation stands out as a renowned method to reduce computational loads, efficiently automating the target rank selection in ViT remains a challenge. Drawing from the notable similarity and alignment between the processes of rank selection and One-Shot NAS, we introduce FLORA, an end-to-end automatic framework based on NAS. To overcome the design challenge of supernet posed by vast search space, FLORA employs a low-rank aware candidate filtering strategy. This method adeptly identifies and eliminates underperforming candidates, effectively alleviating potential undertraining and interference among subnetworks. To further enhance the quality of low-rank supernets, we design a low-rank specific training paradigm. First, we propose weight inheritance to construct supernet and enable gradient sharing among low-rank modules. Secondly, we adopt low-rank aware sampling to strategically allocate training resources, taking into account inherited information from pre-trained models. Empirical results underscore FLORA's efficacy. With our method, a more fine-grained rank configuration can be generated automatically and yield up to 33% extra FLOPs reduction compared to a simple uniform configuration. More specific, FLORA-DeiT-B/FLORA-Swin-B can save up to 55%/42% FLOPs almost without performance degradtion. Importantly, FLORA boasts both versatility and orthogonality, offering an extra 21%-26% FLOPs reduction when integrated with leading compression techniques or compact hybrid structures. Our code is publicly available at https://github.com/shadowpa0327/FLORA.
    Its All Graph To Me: Foundational Topology Models with Contrastive Learning on Multiple Domains. (arXiv:2311.03976v1 [cs.LG])
    Representations and embeddings of graph data have been essential in many domains of research. The principle benefit of learning such representations is that the pre-trained model can be fine-tuned on smaller datasets where data or labels are scarse. Existing models, however, are domain specific; for example a model trained on molecular graphs is fine-tuned on other molecular graphs. This means that in many application cases the choice of pre-trained model can be arbitrary, and novel domains may lack an appropriate pre-trained model. This is of particular issue where data is scarse, precluding traditional supervised methods. In this work we use adversarial contrastive learning to present a \method, a model pre-trained on many graph domains. We train the model only on topologies but include node labels in evaluation. We evaluate the efficacy of its learnt representations on various downstream tasks. Against baseline models pre-trained on single domains, as well as un-trained models and non-transferred models, we show that performance is equal or better using our single model. This includes when node labels are used in evaluation, where performance is consistently superior to single-domain or non-pre-trained models.
    The NCI Imaging Data Commons as a platform for reproducible research in computational pathology. (arXiv:2303.09354v3 [cs.CV] UPDATED)
    Background and Objectives: Reproducibility is a major challenge in developing machine learning (ML)-based solutions in computational pathology (CompPath). The NCI Imaging Data Commons (IDC) provides >120 cancer image collections according to the FAIR principles and is designed to be used with cloud ML services. Here, we explore its potential to facilitate reproducibility in CompPath research. Methods: Using the IDC, we implemented two experiments in which a representative ML-based method for classifying lung tumor tissue was trained and/or evaluated on different datasets. To assess reproducibility, the experiments were run multiple times with separate but identically configured instances of common ML services. Results: The AUC values of different runs of the same experiment were generally consistent. However, we observed small variations in AUC values of up to 0.045, indicating a practical limit to reproducibility. Conclusions: We conclude that the IDC facilitates approaching the reproducibility limit of CompPath research (i) by enabling researchers to reuse exactly the same datasets and (ii) by integrating with cloud ML services so that experiments can be run in identically configured computing environments.  ( 3 min )
    Optimal Transport for Change Detection on LiDAR Point Clouds. (arXiv:2302.07025v4 [cs.CV] UPDATED)
    Unsupervised change detection between airborne LiDAR data points, taken at separate times over the same location, can be difficult due to unmatching spatial support and noise from the acquisition system. Most current approaches to detect changes in point clouds rely heavily on the computation of Digital Elevation Models (DEM) images and supervised methods. Obtaining a DEM leads to LiDAR informational loss due to pixelisation, and supervision requires large amounts of labelled data often unavailable in real-world scenarios. We propose an unsupervised approach based on the computation of the transport of 3D LiDAR points over two temporal supports. The method is based on unbalanced optimal transport and can be generalised to any change detection problem with LiDAR data. We apply our approach to publicly available datasets for monitoring urban sprawling in various noise and resolution configurations that mimic several sensors used in practice. Our method allows for unsupervised multi-class classification and outperforms the previous state-of-the-art unsupervised approaches by a significant margin.
    Climate-Invariant Machine Learning. (arXiv:2112.08440v3 [cs.LG] UPDATED)
    Projecting climate change is a generalization problem: we extrapolate the recent past using physical models across past, present, and future climates. Current climate models require representations of processes that occur at scales smaller than model grid size, which have been the main source of model projection uncertainty. Recent machine learning (ML) algorithms hold promise to improve such process representations, but tend to extrapolate poorly to climate regimes they were not trained on. To get the best of the physical and statistical worlds, we propose a new framework -- termed "climate-invariant" ML -- incorporating knowledge of climate processes into ML algorithms, and show that it can maintain high offline accuracy across a wide range of climate conditions and configurations in three distinct atmospheric models. Our results suggest that explicitly incorporating physical knowledge into data-driven models of Earth system processes can improve their consistency, data efficiency, and generalizability across climate regimes.
    Local Convergence of Gradient Methods for Min-Max Games: Partial Curvature Generically Suffices. (arXiv:2305.17275v2 [math.OC] UPDATED)
    We study the convergence to local Nash equilibria of gradient methods for two-player zero-sum differentiable games. It is well-known that such dynamics converge locally when $S \succ 0$ and may diverge when $S=0$, where $S\succeq 0$ is the symmetric part of the Jacobian at equilibrium that accounts for the "potential" component of the game. We show that these dynamics also converge as soon as $S$ is nonzero (partial curvature) and the eigenvectors of the antisymmetric part $A$ are in general position with respect to the kernel of $S$. We then study the convergence rates when $S \ll A$ and prove that they typically depend on the average of the eigenvalues of $S$, instead of the minimum as an analogy with minimization problems would suggest. To illustrate our results, we consider the problem of computing mixed Nash equilibria of continuous games. We show that, thanks to partial curvature, conic particle methods -- which optimize over both weights and supports of the mixed strategies -- generically converge faster than fixed-support methods. For min-max games, it is thus beneficial to add degrees of freedom "with curvature": this can be interpreted as yet another benefit of over-parameterization.
    Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks. (arXiv:2306.04186v2 [eess.AS] UPDATED)
    Self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. One goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and frame-level tasks. While frame-level tasks are important for fine-grained acoustic scene/event understanding, prior studies primarily evaluate on clip-level downstream tasks. In order to tackle both clip-level and frame-level tasks, this paper proposes Audio Teacher-Student Transformer (ATST), with a clip-level version (named ATST-Clip) and a frame-level version (named ATST-Frame), responsible for learning clip-level and frame-level representations, respectively. Both methods use a Transformer encoder and a teacher-student training scheme. We have carefully designed the view creation strategy for ATST-Clip and ATST-Frame. Specifically, ATST-Clip uses segment-wise data augmentations, and ATST-Frame integrates frame-wise data augmentations and masking. Experimental results show that our ATST-Frame model obtains state-of-the-art (SOTA) performances on most of the clip-level and frame-level downstream tasks. Especially, it outperforms other models by a large margin on the frame-level sound event detection task. In addition, the performance can be further improved by combining the two models through knowledge distillation. Our code is available online.
    PcLast: Discovering Plannable Continuous Latent States. (arXiv:2311.03534v1 [cs.LG])
    Goal-conditioned planning benefits from learned low-dimensional representations of rich, high-dimensional observations. While compact latent representations, typically learned from variational autoencoders or inverse dynamics, enable goal-conditioned planning they ignore state affordances, thus hampering their sample-efficient planning capabilities. In this paper, we learn a representation that associates reachable states together for effective onward planning. We first learn a latent representation with multi-step inverse dynamics (to remove distracting information); and then transform this representation to associate reachable states together in $\ell_2$ space. Our proposals are rigorously tested in various simulation testbeds. Numerical results in reward-based and reward-free settings show significant improvements in sampling efficiency, and yields layered state abstractions that enable computationally efficient hierarchical planning.
    Improved particle-flow event reconstruction with scalable neural networks for current and future particle detectors. (arXiv:2309.06782v3 [physics.data-an] UPDATED)
    We study scalable machine learning models for full event reconstruction in high-energy electron-positron collisions based on a highly granular detector simulation. Particle-flow reconstruction can be formulated as a supervised learning task using tracks and calorimeter clusters or hits. We compare a graph neural network and kernel-based transformer and demonstrate that both avoid quadratic memory allocation and computational cost while achieving realistic reconstruction. We show that hyperparameter tuning on a supercomputer significantly enhances the physics performance of the models, improving the jet transverse momentum resolution by up to 50% compared to the baseline. The resulting model is highly portable across hardware processors. Finally, we demonstrate that the model can be trained on highly granular inputs consisting of tracks and calorimeter hits, resulting in a competitive physics performance with the baseline. Datasets and software to reproduce the studies are published following the findable, accessible, interoperable, and reusable principles.
    Amodal Intra-class Instance Segmentation: Synthetic Datasets and Benchmark. (arXiv:2303.06596v2 [cs.CV] UPDATED)
    Images of realistic scenes often contain intra-class objects that are heavily occluded from each other, making the amodal perception task that requires parsing the occluded parts of the objects challenging. Although important for downstream tasks such as robotic grasping systems, the lack of large-scale amodal datasets with detailed annotations makes it difficult to model intra-class occlusions explicitly. This paper introduces two new amodal datasets for image amodal completion tasks, which contain a total of over 267K images of intra-class occlusion scenarios, annotated with multiple masks, amodal bounding boxes, dual order relations and full appearance for instances and background. We also present a point-supervised scheme with layer priors for amodal instance segmentation specifically designed for intra-class occlusion scenarios. Experiments show that our weakly supervised approach outperforms the SOTA fully supervised methods, while our layer priors design exhibits remarkable performance improvements in the case of intra-class occlusion in both synthetic and real images.  ( 2 min )
    AdaSub: Stochastic Optimization Using Second-Order Information in Low-Dimensional Subspaces. (arXiv:2310.20060v2 [math.OC] UPDATED)
    We introduce AdaSub, a stochastic optimization algorithm that computes a search direction based on second-order information in a low-dimensional subspace that is defined adaptively based on available current and past information. Compared to first-order methods, second-order methods exhibit better convergence characteristics, but the need to compute the Hessian matrix at each iteration results in excessive computational expenses, making them impractical. To address this issue, our approach enables the management of computational expenses and algorithm efficiency by enabling the selection of the subspace dimension for the search. Our code is freely available on GitHub, and our preliminary numerical results demonstrate that AdaSub surpasses popular stochastic optimizers in terms of time and number of iterations required to reach a given accuracy.
    FedMEKT: Distillation-based Embedding Knowledge Transfer for Multimodal Federated Learning. (arXiv:2307.13214v2 [cs.LG] UPDATED)
    Federated learning (FL) enables a decentralized machine learning paradigm for multiple clients to collaboratively train a generalized global model without sharing their private data. Most existing works simply propose typical FL systems for single-modal data, thus limiting its potential on exploiting valuable multimodal data for future personalized applications. Furthermore, the majority of FL approaches still rely on the labeled data at the client side, which is limited in real-world applications due to the inability of self-annotation from users. In light of these limitations, we propose a novel multimodal FL framework that employs a semi-supervised learning approach to leverage the representations from different modalities. Bringing this concept into a system, we develop a distillation-based multimodal embedding knowledge transfer mechanism, namely FedMEKT, which allows the server and clients to exchange the joint knowledge of their learning models extracted from a small multimodal proxy dataset. Our FedMEKT iteratively updates the generalized global encoders with the joint embedding knowledge from the participating clients. Thereby, to address the modality discrepancy and labeled data constraint in existing FL systems, our proposed FedMEKT comprises local multimodal autoencoder learning, generalized multimodal autoencoder construction, and generalized classifier learning. Through extensive experiments on three multimodal human activity recognition datasets, we demonstrate that FedMEKT achieves superior global encoder performance on linear evaluation and guarantees user privacy for personal data and model parameters while demanding less communication cost than other baselines.
    Comparing Causal Frameworks: Potential Outcomes, Structural Models, Graphs, and Abstractions. (arXiv:2306.14351v2 [stat.ME] UPDATED)
    The aim of this paper is to make clear and precise the relationship between the Rubin causal model (RCM) and structural causal model (SCM) frameworks for causal inference. Adopting a neutral logical perspective, and drawing on previous work, we show what is required for an RCM to be representable by an SCM. A key result then shows that every RCM -- including those that violate algebraic principles implied by the SCM framework -- emerges as an abstraction of some representable RCM. Finally, we illustrate the power of this conciliatory perspective by pinpointing an important role for SCM principles in classic applications of RCMs; conversely, we offer a characterization of the algebraic constraints implied by a graph, helping to substantiate further comparisons between the two frameworks.
    A Corrected Expected Improvement Acquisition Function Under Noisy Observations. (arXiv:2310.05166v2 [cs.LG] UPDATED)
    Sequential maximization of expected improvement (EI) is one of the most widely used policies in Bayesian optimization because of its simplicity and ability to handle noisy observations. In particular, the improvement function often uses the best posterior mean as the best incumbent in noisy settings. However, the uncertainty associated with the incumbent solution is often neglected in many analytic EI-type methods: a closed-form acquisition function is derived in the noise-free setting, but then applied to the setting with noisy observations. To address this limitation, we propose a modification of EI that corrects its closed-form expression by incorporating the covariance information provided by the Gaussian Process (GP) model. This acquisition function specializes to the classical noise-free result, and we argue should replace that formula in Bayesian optimization software packages, tutorials, and textbooks. This enhanced acquisition provides good generality for noisy and noiseless settings. We show that our method achieves a sublinear convergence rate on the cumulative regret bound under heteroscedastic observation noise. Our empirical results demonstrate that our proposed acquisition function can outperform EI in the presence of noisy observations on benchmark functions for black-box optimization, as well as on parameter search for neural network model compression.
    A Graph-Theoretic Framework for Understanding Open-World Semi-Supervised Learning. (arXiv:2311.03524v1 [cs.LG])
    Open-world semi-supervised learning aims at inferring both known and novel classes in unlabeled data, by harnessing prior knowledge from a labeled set with known classes. Despite its importance, there is a lack of theoretical foundations for this problem. This paper bridges the gap by formalizing a graph-theoretic framework tailored for the open-world setting, where the clustering can be theoretically characterized by graph factorization. Our graph-theoretic framework illuminates practical algorithms and provides guarantees. In particular, based on our graph formulation, we apply the algorithm called Spectral Open-world Representation Learning (SORL), and show that minimizing our loss is equivalent to performing spectral decomposition on the graph. Such equivalence allows us to derive a provable error bound on the clustering performance for both known and novel classes, and analyze rigorously when labeled data helps. Empirically, SORL can match or outperform several strong baselines on common benchmark datasets, which is appealing for practical usage while enjoying theoretical guarantees.
    Simple and Controllable Music Generation. (arXiv:2306.05284v2 [cs.SD] UPDATED)
    We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, both mono and stereo, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at https://github.com/facebookresearch/audiocraft
    SE(3) Equivariant Augmented Coupling Flows. (arXiv:2308.10364v3 [cs.LG] UPDATED)
    Coupling normalizing flows allow for fast sampling and density evaluation, making them the tool of choice for probabilistic modeling of physical systems. However, the standard coupling architecture precludes endowing flows that operate on the Cartesian coordinates of atoms with the SE(3) and permutation invariances of physical systems. This work proposes a coupling flow that preserves SE(3) and permutation equivariance by performing coordinate splits along additional augmented dimensions. At each layer, the flow maps atoms' positions into learned SE(3) invariant bases, where we apply standard flow transformations, such as monotonic rational-quadratic splines, before returning to the original basis. Crucially, our flow preserves fast sampling and density evaluation, and may be used to produce unbiased estimates of expectations with respect to the target distribution via importance sampling. When trained on the DW4, LJ13, and QM9-positional datasets, our flow is competitive with equivariant continuous normalizing flows, while allowing sampling more than an order of magnitude faster. Moreover, to the best of our knowledge, we are the first to learn the full Boltzmann distribution of alanine dipeptide by only modeling the Cartesian positions of its atoms. Lastly, we demonstrate that our flow can be trained to approximately sample from the Boltzmann distribution of the DW4 and LJ13 particle systems using only their energy functions.
    Hebbian learning inspired estimation of the linear regression parameters from queries. (arXiv:2311.03483v1 [math.ST])
    Local learning rules in biological neural networks (BNNs) are commonly referred to as Hebbian learning. [26] links a biologically motivated Hebbian learning rule to a specific zeroth-order optimization method. In this work, we study a variation of this Hebbian learning rule to recover the regression vector in the linear regression model. Zeroth-order optimization methods are known to converge with suboptimal rate for large parameter dimension compared to first-order methods like gradient descent, and are therefore thought to be in general inferior. By establishing upper and lower bounds, we show, however, that such methods achieve near-optimal rates if only queries of the linear regression loss are available. Moreover, we prove that this Hebbian learning rule can achieve considerably faster rates than any non-adaptive method that selects the queries independently of the data.
    Spatio-Temporal Similarity Measure based Multi-Task Learning for Predicting Alzheimer's Disease Progression using MRI Data. (arXiv:2311.03557v1 [cs.LG])
    Identifying and utilising various biomarkers for tracking Alzheimer's disease (AD) progression have received many recent attentions and enable helping clinicians make the prompt decisions. Traditional progression models focus on extracting morphological biomarkers in regions of interest (ROIs) from MRI/PET images, such as regional average cortical thickness and regional volume. They are effective but ignore the relationships between brain ROIs over time, which would lead to synergistic deterioration. For exploring the synergistic deteriorating relationship between these biomarkers, in this paper, we propose a novel spatio-temporal similarity measure based multi-task learning approach for effectively predicting AD progression and sensitively capturing the critical relationships between biomarkers. Specifically, we firstly define a temporal measure for estimating the magnitude and velocity of biomarker change over time, which indicate a changing trend(temporal). Converting this trend into the vector, we then compare this variability between biomarkers in a unified vector space(spatial). The experimental results show that compared with directly ROI based learning, our proposed method is more effective in predicting disease progression. Our method also enables performing longitudinal stability selection to identify the changing relationships between biomarkers, which play a key role in disease progression. We prove that the synergistic deteriorating biomarkers between cortical volumes or surface areas have a significant effect on the cognitive prediction.
    Counterfactual Data Augmentation with Contrastive Learning. (arXiv:2311.03630v1 [cs.LG])
    Statistical disparity between distinct treatment groups is one of the most significant challenges for estimating Conditional Average Treatment Effects (CATE). To address this, we introduce a model-agnostic data augmentation method that imputes the counterfactual outcomes for a selected subset of individuals. Specifically, we utilize contrastive learning to learn a representation space and a similarity measure such that in the learned representation space close individuals identified by the learned similarity measure have similar potential outcomes. This property ensures reliable imputation of counterfactual outcomes for the individuals with close neighbors from the alternative treatment group. By augmenting the original dataset with these reliable imputations, we can effectively reduce the discrepancy between different treatment groups, while inducing minimal imputation error. The augmented dataset is subsequently employed to train CATE estimation models. Theoretical analysis and experimental studies on synthetic and semi-synthetic benchmarks demonstrate that our method achieves significant improvements in both performance and robustness to overfitting across state-of-the-art models.
    Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL. (arXiv:2310.04411v2 [cs.LG] UPDATED)
    The divergence of the Q-value estimation has been a prominent issue in offline RL, where the agent has no access to real dynamics. Traditional beliefs attribute this instability to querying out-of-distribution actions when bootstrapping value targets. Though this issue can be alleviated with policy constraints or conservative Q estimation, a theoretical understanding of the underlying mechanism causing the divergence has been absent. In this work, we aim to thoroughly comprehend this mechanism and attain an improved solution. We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. Then, we propose a novel Self-Excite Eigenvalue Measure (SEEM) metric based on Neural Tangent Kernel (NTK) to measure the evolving property of Q-network at training, which provides an intriguing explanation of the emergence of divergence. For the first time, our theory can reliably decide whether the training will diverge at an early stage, and even predict the order of the growth for the estimated Q-value, the model's norm, and the crashing step when an SGD optimizer is used. The experiments demonstrate perfect alignment with this theoretic analysis. Building on our insights, we propose to resolve divergence from a novel perspective, namely improving the model's architecture for better extrapolating behavior. Through extensive empirical studies, we identify LayerNorm as a good solution to effectively avoid divergence without introducing detrimental bias, leading to superior performance. Experimental results prove that it can still work in some most challenging settings, i.e. using only 1 transitions of the dataset, where all previous methods fail. Moreover, it can be easily plugged into modern offline RL methods and achieve SOTA results on many challenging tasks. We also give unique insights into its effectiveness.  ( 3 min )
    Differentially Private Pre-Trained Model Fusion using Decentralized Federated Graph Matching. (arXiv:2311.03396v1 [cs.LG])
    Model fusion is becoming a crucial component in the context of model-as-a-service scenarios, enabling the delivery of high-quality model services to local users. However, this approach introduces privacy risks and imposes certain limitations on its applications. Ensuring secure model exchange and knowledge fusion among users becomes a significant challenge in this setting. To tackle this issue, we propose PrivFusion, a novel architecture that preserves privacy while facilitating model fusion under the constraints of local differential privacy. PrivFusion leverages a graph-based structure, enabling the fusion of models from multiple parties without necessitating retraining. By employing randomized mechanisms, PrivFusion ensures privacy guarantees throughout the fusion process. To enhance model privacy, our approach incorporates a hybrid local differentially private mechanism and decentralized federated graph matching, effectively protecting both activation values and weights. Additionally, we introduce a perturbation filter adapter to alleviate the impact of randomized noise, thereby preserving the utility of the fused model. Through extensive experiments conducted on diverse image datasets and real-world healthcare applications, we provide empirical evidence showcasing the effectiveness of PrivFusion in maintaining model performance while preserving privacy. Our contributions offer valuable insights and practical solutions for secure and collaborative data analysis within the domain of privacy-preserving model fusion.
    Formulating Discrete Probability Flow Through Optimal Transport. (arXiv:2311.03886v1 [cs.LG])
    Continuous diffusion models are commonly acknowledged to display a deterministic probability flow, whereas discrete diffusion models do not. In this paper, we aim to establish the fundamental theory for the probability flow of discrete diffusion models. Specifically, we first prove that the continuous probability flow is the Monge optimal transport map under certain conditions, and also present an equivalent evidence for discrete cases. In view of these findings, we are then able to define the discrete probability flow in line with the principles of optimal transport. Finally, drawing upon our newly established definitions, we propose a novel sampling method that surpasses previous discrete diffusion models in its ability to generate more certain outcomes. Extensive experiments on the synthetic toy dataset and the CIFAR-10 dataset have validated the effectiveness of our proposed discrete probability flow. Code is released at: https://github.com/PangzeCheung/Discrete-Probability-Flow.
    Learning Decentralized Traffic Signal Controllers with Multi-Agent Graph Reinforcement Learning. (arXiv:2311.03756v1 [cs.LG])
    This paper considers optimal traffic signal control in smart cities, which has been taken as a complex networked system control problem. Given the interacting dynamics among traffic lights and road networks, attaining controller adaptivity and scalability stands out as a primary challenge. Capturing the spatial-temporal correlation among traffic lights under the framework of Multi-Agent Reinforcement Learning (MARL) is a promising solution. Nevertheless, existing MARL algorithms ignore effective information aggregation which is fundamental for improving the learning capacity of decentralized agents. In this paper, we design a new decentralized control architecture with improved environmental observability to capture the spatial-temporal correlation. Specifically, we first develop a topology-aware information aggregation strategy to extract correlation-related information from unstructured data gathered in the road network. Particularly, we transfer the road network topology into a graph shift operator by forming a diffusion process on the topology, which subsequently facilitates the construction of graph signals. A diffusion convolution module is developed, forming a new MARL algorithm, which endows agents with the capabilities of graph learning. Extensive experiments based on both synthetic and real-world datasets verify that our proposal outperforms existing decentralized algorithms.
    Latent Diffusion for Language Generation. (arXiv:2212.09462v2 [cs.CL] UPDATED)
    Diffusion models have achieved great success in modeling continuous data modalities such as images, audio, and video, but have seen limited use in discrete domains such as language. Recent attempts to adapt diffusion to language have presented diffusion as an alternative to existing pretrained language models. We view diffusion and existing language models as complementary. We demonstrate that encoder-decoder language models can be utilized to efficiently learn high-quality language autoencoders. We then demonstrate that continuous diffusion models can be learned in the latent space of the language autoencoder, enabling us to sample continuous latent representations that can be decoded into natural language with the pretrained decoder. We validate the effectiveness of our approach for unconditional, class-conditional, and sequence-to-sequence language generation. We demonstrate across multiple diverse data sets that our latent language diffusion models are significantly more effective than previous diffusion language models.
    Unscrambling the Rectification of Adversarial Attacks Transferability across Computer Networks. (arXiv:2311.03373v1 [cs.CR])
    Convolutional neural networks (CNNs) models play a vital role in achieving state-of-the-art performances in various technological fields. CNNs are not limited to Natural Language Processing (NLP) or Computer Vision (CV) but also have substantial applications in other technological domains, particularly in cybersecurity. The reliability of CNN's models can be compromised because of their susceptibility to adversarial attacks, which can be generated effortlessly, easily applied, and transferred in real-world scenarios. In this paper, we present a novel and comprehensive method to improve the strength of attacks and assess the transferability of adversarial examples in CNNs when such strength changes, as well as whether the transferability property issue exists in computer network applications. In the context of our study, we initially examined six distinct modes of attack: the Carlini and Wagner (C&W), Fast Gradient Sign Method (FGSM), Iterative Fast Gradient Sign Method (I-FGSM), Jacobian-based Saliency Map (JSMA), Limited-memory Broyden fletcher Goldfarb Shanno (L-BFGS), and Projected Gradient Descent (PGD) attack. We applied these attack techniques on two popular datasets: the CIC and UNSW datasets. The outcomes of our experiment demonstrate that an improvement in transferability occurs in the targeted scenarios for FGSM, JSMA, LBFGS, and other attacks. Our findings further indicate that the threats to security posed by adversarial examples, even in computer network applications, necessitate the development of novel defense mechanisms to enhance the security of DL-based techniques.
    PowerFlowNet: Leveraging Message Passing GNNs for Improved Power Flow Approximation. (arXiv:2311.03415v1 [cs.LG])
    Accurate and efficient power flow (PF) analysis is crucial in modern electrical networks' efficient operation and planning. Therefore, there is a need for scalable algorithms capable of handling large-scale power networks that can provide accurate and fast solutions. Graph Neural Networks (GNNs) have emerged as a promising approach for enhancing the speed of PF approximations by leveraging their ability to capture distinctive features from the underlying power network graph. In this study, we introduce PowerFlowNet, a novel GNN architecture for PF approximation that showcases similar performance with the traditional Newton-Raphson method but achieves it 4 times faster in the simple IEEE 14-bus system and 145 times faster in the realistic case of the French high voltage network (6470rte). Meanwhile, it significantly outperforms other traditional approximation methods, such as the DC relaxation method, in terms of performance and execution time; therefore, making PowerFlowNet a highly promising solution for real-world PF analysis. Furthermore, we verify the efficacy of our approach by conducting an in-depth experimental evaluation, thoroughly examining the performance, scalability, interpretability, and architectural dependability of PowerFlowNet. The evaluation provides insights into the behavior and potential applications of GNNs in power system analysis.
    Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search. (arXiv:2311.03583v1 [cs.AI])
    This work studies a central extremal graph theory problem inspired by a 1975 conjecture of Erd\H{o}s, which aims to find graphs with a given size (number of nodes) that maximize the number of edges without having 3- or 4-cycles. We formulate this problem as a sequential decision-making problem and compare AlphaZero, a neural network-guided tree search, with tabu search, a heuristic local search method. Using either method, by introducing a curriculum -- jump-starting the search for larger graphs using good graphs found at smaller sizes -- we improve the state-of-the-art lower bounds for several sizes. We also propose a flexible graph-generation environment and a permutation-invariant network architecture for learning to search in the space of graphs.
    Stable Modular Control via Contraction Theory for Reinforcement Learning. (arXiv:2311.03669v1 [cs.LG])
    We propose a novel way to integrate control techniques with reinforcement learning (RL) for stability, robustness, and generalization: leveraging contraction theory to realize modularity in neural control, which ensures that combining stable subsystems can automatically preserve the stability. We realize such modularity via signal composition and dynamic decomposition. Signal composition creates the latent space, within which RL applies to maximizing rewards. Dynamic decomposition is realized by coordinate transformation that creates an auxiliary space, within which the latent signals are coupled in the way that their combination can preserve stability provided each signal, that is, each subsystem, has stable self-feedbacks. Leveraging modularity, the nonlinear stability problem is deconstructed into algebraically solvable ones, the stability of the subsystems in the auxiliary space, yielding linear constraints on the input gradients of control networks that can be as simple as switching the signs of network weights. This minimally invasive method for stability allows arguably easy integration into the modular neural architectures in machine learning, like hierarchical RL, and improves their performance. We demonstrate in simulation the necessity and the effectiveness of our method: the necessity for robustness and generalization, and the effectiveness in improving hierarchical RL for manipulation learning.
    Text Augmentations with R-drop for Classification of Tweets Self Reporting Covid-19. (arXiv:2311.03420v1 [cs.CL])
    This paper presents models created for the Social Media Mining for Health 2023 shared task. Our team addressed the first task, classifying tweets that self-report Covid-19 diagnosis. Our approach involves a classification model that incorporates diverse textual augmentations and utilizes R-drop to augment data and mitigate overfitting, boosting model efficacy. Our leading model, enhanced with R-drop and augmentations like synonym substitution, reserved words, and back translations, outperforms the task mean and median scores. Our system achieves an impressive F1 score of 0.877 on the test set.
    Graph Neural Networks for Power Grid Operational Risk Assessment. (arXiv:2311.03661v1 [eess.SY])
    In this article, the utility of graph neural network (GNN) surrogates for Monte Carlo (MC) sampling-based risk quantification in daily operations of power grid is investigated. The MC simulation process necessitates solving a large number of optimal power flow (OPF) problems corresponding to the sample values of stochastic grid variables (power demand and renewable generation), which is computationally prohibitive. Computationally inexpensive surrogates of the OPF problem provide an attractive alternative for expedited MC simulation. GNN surrogates are especially suitable due to their superior ability to handle graph-structured data. Therefore, GNN surrogates of OPF problem are trained using supervised learning. They are then used to obtain Monte Carlo (MC) samples of the quantities of interest (operating reserve, transmission line flow) given the (hours-ahead) probabilistic wind generation and load forecast. The utility of GNN surrogates is evaluated by comparing OPF-based and GNN-based grid reliability and risk for IEEE Case118 synthetic grid. It is shown that the GNN surrogates are sufficiently accurate for predicting the (bus-level, branch-level and system-level) grid state and enable fast as well as accurate operational risk quantification for power grids. The article thus develops various tools for fast reliability and risk quantification for real-world power grids using GNNs.
    Hopfield-Enhanced Deep Neural Networks for Artifact-Resilient Brain State Decoding. (arXiv:2311.03421v1 [q-bio.NC])
    The study of brain states, ranging from highly synchronous to asynchronous neuronal patterns like the sleep-wake cycle, is fundamental for assessing the brain's spatiotemporal dynamics and their close connection to behavior. However, the development of new techniques to accurately identify them still remains a challenge, as these are often compromised by the presence of noise, artifacts, and suboptimal recording quality. In this study, we propose a two-stage computational framework combining Hopfield Networks for artifact data preprocessing with Convolutional Neural Networks (CNNs) for classification of brain states in rat neural recordings under different levels of anesthesia. To evaluate the robustness of our framework, we deliberately introduced noise artifacts into the neural recordings. We evaluated our hybrid Hopfield-CNN pipeline by benchmarking it against two comparative models: a standalone CNN handling the same noisy inputs, and another CNN trained and tested on artifact-free data. Performance across various levels of data compression and noise intensities showed that our framework can effectively mitigate artifacts, allowing the model to reach parity with the clean-data CNN at lower noise levels. Although this study mainly benefits small-scale experiments, the findings highlight the necessity for advanced deep learning and Hopfield Network models to improve scalability and robustness in diverse real-world settings.
    Loss Dynamics of Temporal Difference Reinforcement Learning. (arXiv:2307.04841v2 [stat.ML] UPDATED)
    Reinforcement learning has been successful across several applications in which agents have to learn to act in environments with sparse feedback. However, despite this empirical success there is still a lack of theoretical understanding of how the parameters of reinforcement learning models and the features used to represent states interact to control the dynamics of learning. In this work, we use concepts from statistical physics, to study the typical case learning curves for temporal difference learning of a value function with linear function approximators. Our theory is derived under a Gaussian equivalence hypothesis where averages over the random trajectories are replaced with temporally correlated Gaussian feature averages and we validate our assumptions on small scale Markov Decision Processes. We find that the stochastic semi-gradient noise due to subsampling the space of possible episodes leads to significant plateaus in the value error, unlike in traditional gradient descent dynamics. We study how learning dynamics and plateaus depend on feature structure, learning rate, discount factor, and reward function. We then analyze how strategies like learning rate annealing and reward shaping can favorably alter learning dynamics and plateaus. To conclude, our work introduces new tools to open a new direction towards developing a theory of learning dynamics in reinforcement learning.  ( 2 min )
    Size Matters: Large Graph Generation with HiGGs. (arXiv:2306.11412v2 [cs.LG] UPDATED)
    Large graphs are present in a variety of domains, including social networks, civil infrastructure, and the physical sciences to name a few. Graph generation is similarly widespread, with applications in drug discovery, network analysis and synthetic datasets among others. While GNN (Graph Neural Network) models have been applied in these domains their high in-memory costs restrict them to small graphs. Conversely less costly rule-based methods struggle to reproduce complex structures. We propose HIGGS (Hierarchical Generation of Graphs) as a model-agnostic framework of producing large graphs with realistic local structures. HIGGS uses GNN models with conditional generation capabilities to sample graphs in hierarchies of resolution. As a result HIGGS has the capacity to extend the scale of generated graphs from a given GNN model by quadratic order. As a demonstration we implement HIGGS using DiGress, a recent graph-diffusion model, including a novel edge-predictive-diffusion variant edge-DiGress. We use this implementation to generate categorically attributed graphs with tens of thousands of nodes. These HIGGS generated graphs are far larger than any previously produced using GNNs. Despite this jump in scale we demonstrate that the graphs produced by HIGGS are, on the local scale, more realistic than those from the rule-based model BTER.  ( 2 min )
    k-Means Maximum Entropy Exploration. (arXiv:2205.15623v4 [cs.LG] UPDATED)
    Exploration in high-dimensional, continuous spaces with sparse rewards is an open problem in reinforcement learning. Artificial curiosity algorithms address this by creating rewards that lead to exploration. Given a reinforcement learning algorithm capable of maximizing rewards, the problem reduces to finding an optimization objective consistent with exploration. Maximum entropy exploration uses the entropy of the state visitation distribution as such an objective. However, efficiently estimating the entropy of the state visitation distribution is challenging in high-dimensional, continuous spaces. We introduce an artificial curiosity algorithm based on lower bounding an approximation to the entropy of the state visitation distribution. The bound relies on a result we prove for non-parametric density estimation in arbitrary dimensions using k-means. We show that our approach is both computationally efficient and competitive on benchmarks for exploration in high-dimensional, continuous spaces, especially on tasks where reinforcement learning algorithms are unable to find rewards.
    A Logic for Expressing Log-Precision Transformers. (arXiv:2210.02671v6 [cs.LG] UPDATED)
    One way to interpret the reasoning power of transformer-based language models is to describe the types of logical rules they can resolve over some input text. Recently, Chiang et al. (2023) showed that finite-precision transformers can be equivalently expressed in a generalization of first-order logic. However, finite-precision transformers are a weak transformer variant because, as we show, a single head can only attend to a constant number of tokens and, in particular, cannot represent uniform attention. Since attending broadly is a core capability for transformers, we ask whether a minimally more expressive model that can attend universally can also be characterized in logic. To this end, we analyze transformers whose forward pass is computed in $\log n$ precision on contexts of length $n$. We prove that any log-precision transformer can be equivalently expressed as a first-order logic sentence that, in addition to standard universal and existential quantifiers, may also contain majority-vote quantifiers. This is the tightest known upper bound and first logical characterization of log-precision transformers.  ( 2 min )
    Random Field Augmentations for Self-Supervised Representation Learning. (arXiv:2311.03629v1 [cs.CV])
    Self-supervised representation learning is heavily dependent on data augmentations to specify the invariances encoded in representations. Previous work has shown that applying diverse data augmentations is crucial to downstream performance, but augmentation techniques remain under-explored. In this work, we propose a new family of local transformations based on Gaussian random fields to generate image augmentations for self-supervised representation learning. These transformations generalize the well-established affine and color transformations (translation, rotation, color jitter, etc.) and greatly increase the space of augmentations by allowing transformation parameter values to vary from pixel to pixel. The parameters are treated as continuous functions of spatial coordinates, and modeled as independent Gaussian random fields. Empirical results show the effectiveness of the new transformations for self-supervised representation learning. Specifically, we achieve a 1.7% top-1 accuracy improvement over baseline on ImageNet downstream classification, and a 3.6% improvement on out-of-distribution iNaturalist downstream classification. However, due to the flexibility of the new transformations, learned representations are sensitive to hyperparameters. While mild transformations improve representations, we observe that strong transformations can degrade the structure of an image, indicating that balancing the diversity and strength of augmentations is important for improving generalization of learned representations.
    The Linear Representation Hypothesis and the Geometry of Large Language Models. (arXiv:2311.03658v1 [cs.CL])
    Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.
    Procedural Image Programs for Representation Learning. (arXiv:2211.16412v2 [cs.CV] UPDATED)
    Learning image representations using synthetic data allows training neural networks without some of the concerns associated with real images, such as privacy and bias. Existing work focuses on a handful of curated generative processes which require expert knowledge to design, making it hard to scale up. To overcome this, we propose training with a large dataset of twenty-one thousand programs, each one generating a diverse set of synthetic images. These programs are short code snippets, which are easy to modify and fast to execute using OpenGL. The proposed dataset can be used for both supervised and unsupervised representation learning, and reduces the gap between pre-training with real and procedurally generated images by 38%.
    Topologically Regularized Data Embeddings. (arXiv:2301.03338v2 [cs.LG] UPDATED)
    Unsupervised representation learning methods are widely used for gaining insight into high-dimensional, unstructured, or structured data. In some cases, users may have prior topological knowledge about the data, such as a known cluster structure or the fact that the data is known to lie along a tree- or graph-structured topology. However, generic methods to ensure such structure is salient in the low-dimensional representations are lacking. This negatively impacts the interpretability of low-dimensional embeddings, and plausibly downstream learning tasks. To address this issue, we introduce topological regularization: a generic approach based on algebraic topology to incorporate topological prior knowledge into low-dimensional embeddings. We introduce a class of topological loss functions, and show that jointly optimizing an embedding loss with such a topological loss function as a regularizer yields embeddings that reflect not only local proximities but also the desired topological structure. We include a self-contained overview of the required foundational concepts in algebraic topology, and provide intuitive guidance on how to design topological loss functions for a variety of shapes, such as clusters, cycles, and bifurcations. We empirically evaluate the proposed approach on computational efficiency, robustness, and versatility in combination with linear and non-linear dimensionality reduction and graph embedding methods.  ( 2 min )
    Exploring the Optimal Choice for Generative Processes in Diffusion Models: Ordinary vs Stochastic Differential Equations. (arXiv:2306.02063v2 [cs.LG] UPDATED)
    The diffusion model has shown remarkable success in computer vision, but it remains unclear whether the ODE-based probability flow or the SDE-based diffusion model is more superior and under what circumstances. Comparing the two is challenging due to dependencies on data distributions, score training, and other numerical issues. In this paper, we study the problem mathematically for two limiting scenarios: the zero diffusion (ODE) case and the large diffusion case. We first introduce a pulse-shape error to perturb the score function and analyze error accumulation of sampling quality, followed by a thorough analysis for generalization to arbitrary error. Our findings indicate that when the perturbation occurs at the end of the generative process, the ODE model outperforms the SDE model with a large diffusion coefficient. However, when the perturbation occurs earlier, the SDE model outperforms the ODE model, and we demonstrate that the error of sample generation due to such a pulse-shape perturbation is exponentially suppressed as the diffusion term's magnitude increases to infinity. Numerical validation of this phenomenon is provided using Gaussian, Gaussian mixture, and Swiss roll distribution, as well as realistic datasets like MNIST and CIFAR-10.  ( 2 min )
    Offline Policy Evaluation and Optimization under Confounding. (arXiv:2211.16583v4 [stat.ML] UPDATED)
    Evaluating and optimizing policies in the presence of unobserved confounders is a problem of growing interest in offline reinforcement learning. Using conventional methods for offline RL in the presence of confounding can not only lead to poor decisions and poor policies, but also have disastrous effects in critical applications such as healthcare and education. We map out the landscape of offline policy evaluation for confounded MDPs, distinguishing assumptions on confounding based on whether they are memoryless and on their effect on the data-collection policies. We characterize settings where consistent value estimates are provably not achievable, and provide algorithms with guarantees to instead estimate lower bounds on the value. When consistent estimates are achievable, we provide algorithms for value estimation with sample complexity guarantees. We also present new algorithms for offline policy improvement and prove local convergence guarantees. Finally, we experimentally evaluate our algorithms on both a gridworld environment and a simulated healthcare setting of managing sepsis patients. In gridworld, our model-based method provides tighter lower bounds than existing methods, while in the sepsis simulator, our methods significantly outperform confounder-oblivious benchmarks.  ( 2 min )
    Attention-Enhanced Deep Learning for Device-Free Through-the-Wall Presence Detection Using Indoor WiFi System. (arXiv:2304.13105v2 [cs.LG] UPDATED)
    Accurate detection of human presence in indoor environments is important for various applications, such as energy management and security. In this paper, we propose a novel system for human presence detection using the channel state information (CSI) of WiFi signals. Our system named attention-enhanced deep learning for presence detection (ALPD) employs an attention mechanism to automatically select informative subcarriers from the CSI data and a bidirectional long short-term memory (LSTM) network to capture temporal dependencies in CSI. Additionally, we utilize a static feature to improve the accuracy of human presence detection in static states. We evaluate the proposed ALPD system by deploying a pair of WiFi access points (APs) for collecting CSI dataset, which is further compared with several benchmarks. The results demonstrate that our ALPD system outperforms the benchmarks in terms of accuracy, especially in the presence of interference. Moreover, bidirectional transmission data is beneficial to training improving stability and accuracy, as well as reducing the costs of data collection for training. Overall, our proposed ALPD system shows promising results for human presence detection using WiFi CSI signals.  ( 2 min )
    Doubly Robust Kernel Statistics for Testing Distributional Treatment Effects. (arXiv:2212.04922v2 [stat.ML] UPDATED)
    With the widespread application of causal inference, it is increasingly important to have tools which can test for the presence of causal effects in a diverse array of circumstances. In this vein we focus on the problem of testing for \emph{distributional} causal effects, where the treatment affects not just the mean, but also higher order moments of the distribution, as well as multidimensional or structured outcomes. We build upon a previously introduced framework, Counterfactual Mean Embeddings, for representing causal distributions within Reproducing Kernel Hilbert Spaces (RKHS) by proposing new, improved, estimators for the distributional embeddings. These improved estimators are inspired by doubly robust estimators of the causal mean, using a similar form within the kernel space. We analyse these estimators, proving they retain the doubly robust property and have improved convergence rates compared to the original estimators. This leads to new permutation based tests for distributional causal effects, using the estimators we propose as tests statistics. We experimentally and theoretically demonstrate the validity of our tests.  ( 2 min )
    Visualizing DNA reaction trajectories with deep graph embedding approaches. (arXiv:2311.03409v1 [q-bio.BM])
    Synthetic biologists and molecular programmers design novel nucleic acid reactions, with many potential applications. Good visualization tools are needed to help domain experts make sense of the complex outputs of folding pathway simulations of such reactions. Here we present ViDa, a new approach for visualizing DNA reaction folding trajectories over the energy landscape of secondary structures. We integrate a deep graph embedding model with common dimensionality reduction approaches, to map high-dimensional data onto 2D Euclidean space. We assess ViDa on two well-studied and contrasting DNA hybridization reactions. Our preliminary results suggest that ViDa's visualization successfully separates trajectories with different folding mechanisms, thereby providing useful insight to users, and is a big improvement over the current state-of-the-art in DNA kinetics visualization.
    Pipeline Parallelism for DNN Inference with Practical Performance Guarantees. (arXiv:2311.03703v1 [cs.LG])
    We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into $k$ stages and minimizing the running time of the bottleneck stage, including communication. We design practical algorithms for this NP-hard problem and show that they are nearly optimal in practice by comparing against strong lower bounds obtained via novel mixed-integer programming (MIP) formulations. We apply these algorithms and lower-bound methods to production models to achieve substantially improved approximation guarantees compared to standard combinatorial lower bounds. For example, evaluated via geometric means across production data with $k=16$ pipeline stages, our MIP formulations more than double the lower bounds, improving the approximation ratio from $2.175$ to $1.058$. This work shows that while max-throughput partitioning is theoretically hard, we have a handle on the algorithmic side of the problem in practice and much of the remaining challenge is in developing more accurate cost models to feed into the partitioning algorithms.
    Generative Diffusion Models for Lattice Field Theory. (arXiv:2311.03578v1 [hep-lat])
    This study delves into the connection between machine learning and lattice field theory by linking generative diffusion models (DMs) with stochastic quantization, from a stochastic differential equation perspective. We show that DMs can be conceptualized by reversing a stochastic process driven by the Langevin equation, which then produces samples from an initial distribution to approximate the target distribution. In a toy model, we highlight the capability of DMs to learn effective actions. Furthermore, we demonstrate its feasibility to act as a global sampler for generating configurations in the two-dimensional $\phi^4$ quantum lattice field theory.
    Training Multi-layer Neural Networks on Ising Machine. (arXiv:2311.03408v1 [cs.LG])
    As a dedicated quantum device, Ising machines could solve large-scale binary optimization problems in milliseconds. There is emerging interest in utilizing Ising machines to train feedforward neural networks due to the prosperity of generative artificial intelligence. However, existing methods can only train single-layer feedforward networks because of the complex nonlinear network topology. This paper proposes an Ising learning algorithm to train quantized neural network (QNN), by incorporating two essential techinques, namely binary representation of topological network and order reduction of loss function. As far as we know, this is the first algorithm to train multi-layer feedforward networks on Ising machines, providing an alternative to gradient-based backpropagation. Firstly, training QNN is formulated as a quadratic constrained binary optimization (QCBO) problem by representing neuron connection and activation function as equality constraints. All quantized variables are encoded by binary bits based on binary encoding protocol. Secondly, QCBO is converted to a quadratic unconstrained binary optimization (QUBO) problem, that can be efficiently solved on Ising machines. The conversion leverages both penalty function and Rosenberg order reduction, who together eliminate equality constraints and reduce high-order loss function into a quadratic one. With some assumptions, theoretical analysis shows the space complexity of our algorithm is $\mathcal{O}(H^2L + HLN\log H)$, quantifying the required number of Ising spins. Finally, the algorithm effectiveness is validated with a simulated Ising machine on MNIST dataset. After annealing 700 ms, the classification accuracy achieves 98.3%. Among 100 runs, the success probability of finding the optimal solution is 72%. Along with the increasing number of spins on Ising machine, our algorithm has the potential to train deeper neural networks.
    Causal Structure Representation Learning of Confounders in Latent Space for Recommendation. (arXiv:2311.03382v1 [cs.IR])
    Inferring user preferences from the historical feedback of users is a valuable problem in recommender systems. Conventional approaches often rely on the assumption that user preferences in the feedback data are equivalent to the real user preferences without additional noise, which simplifies the problem modeling. However, there are various confounders during user-item interactions, such as weather and even the recommendation system itself. Therefore, neglecting the influence of confounders will result in inaccurate user preferences and suboptimal performance of the model. Furthermore, the unobservability of confounders poses a challenge in further addressing the problem. To address these issues, we refine the problem and propose a more rational solution. Specifically, we consider the influence of confounders, disentangle them from user preferences in the latent space, and employ causal graphs to model their interdependencies without specific labels. By cleverly combining local and global causal graphs, we capture the user-specificity of confounders on user preferences. We theoretically demonstrate the identifiability of the obtained causal graph. Finally, we propose our model based on Variational Autoencoders, named Causal Structure representation learning of Confounders in latent space (CSC). We conducted extensive experiments on one synthetic dataset and five real-world datasets, demonstrating the superiority of our model. Furthermore, we demonstrate that the learned causal representations of confounders are controllable, potentially offering users fine-grained control over the objectives of their recommendation lists with the learned causal graphs.
    Can We Trust the Similarity Measurement in Federated Learning?. (arXiv:2311.03369v1 [cs.LG])
    Is it secure to measure the reliability of local models by similarity in federated learning (FL)? This paper delves into an unexplored security threat concerning applying similarity metrics, such as the L_2 norm, Euclidean distance, and cosine similarity, in protecting FL. We first uncover the deficiencies of similarity metrics that high-dimensional local models, including benign and poisoned models, may be evaluated to have the same similarity while being significantly different in the parameter values. We then leverage this finding to devise a novel untargeted model poisoning attack, Faker, which launches the attack by simultaneously maximizing the evaluated similarity of the poisoned local model and the difference in the parameter values. Experimental results based on seven datasets and eight defenses show that Faker outperforms the state-of-the-art benchmark attacks by 1.1-9.0X in reducing accuracy and 1.2-8.0X in saving time cost, which even holds for the case of a single malicious client with limited knowledge about the FL system. Moreover, Faker can degrade the performance of the global model by attacking only once. We also preliminarily explore extending Faker to other attacks, such as backdoor attacks and Sybil attacks. Lastly, we provide a model evaluation strategy, called the similarity of partial parameters (SPP), to defend against Faker. Given that numerous mechanisms in FL utilize similarity metrics to assess local models, this work suggests that we should be vigilant regarding the potential risks of using these metrics.
    CMIP X-MOS: Improving Climate Models with Extreme Model Output Statistics. (arXiv:2311.03370v1 [physics.ao-ph])
    Climate models are essential for assessing the impact of greenhouse gas emissions on our changing climate and the resulting increase in the frequency and severity of natural disasters. Despite the widespread acceptance of climate models produced by the Coupled Model Intercomparison Project (CMIP), they still face challenges in accurately predicting climate extremes, which pose most significant threats to both people and the environment. To address this limitation and improve predictions of natural disaster risks, we introduce Extreme Model Output Statistics (X-MOS). This approach utilizes deep regression techniques to precisely map CMIP model outputs to real measurements obtained from weather stations, which results in a more accurate analysis of the XXI climate extremes. In contrast to previous research, our study places a strong emphasis on enhancing the estimation of the tails of future climate parameter distributions. The latter supports decision-makers, enabling them to better assess climate-related risks across the globe.
    Transferability and explainability of deep learning emulators for regional climate model projections: Perspectives for future applications. (arXiv:2311.03378v1 [physics.ao-ph])
    Regional climate models (RCMs) are essential tools for simulating and studying regional climate variability and change. However, their high computational cost limits the production of comprehensive ensembles of regional climate projections covering multiple scenarios and driving Global Climate Models (GCMs) across regions. RCM emulators based on deep learning models have recently been introduced as a cost-effective and promising alternative that requires only short RCM simulations to train the models. Therefore, evaluating their transferability to different periods, scenarios, and GCMs becomes a pivotal and complex task in which the inherent biases of both GCMs and RCMs play a significant role. Here we focus on this problem by considering the two different emulation approaches proposed in the literature (PP and MOS, following the terminology introduced in this paper). In addition to standard evaluation techniques, we expand the analysis with methods from the field of eXplainable Artificial Intelligence (XAI), to assess the physical consistency of the empirical links learnt by the models. We find that both approaches are able to emulate certain climatological properties of RCMs for different periods and scenarios (soft transferability), but the consistency of the emulation functions differ between approaches. Whereas PP learns robust and physically meaningful patterns, MOS results are GCM-dependent and lack physical consistency in some cases. Both approaches face problems when transferring the emulation function to other GCMs, due to the existence of GCM-dependent biases (hard transferability). This limits their applicability to build ensembles of regional climate projections. We conclude by giving some prospects for future applications.
    Kernel-based Joint Multiple Graph Learning and Clustering of Graph Signals. (arXiv:2310.19005v2 [eess.SP] UPDATED)
    Within the context of Graph Signal Processing (GSP), Graph Learning (GL) is concerned with the inference of the graph's underlying structure from nodal observations. However, real-world data often contains diverse information, necessitating the simultaneous clustering and learning of multiple graphs. In practical applications, valuable node-specific covariates, represented as kernels, have been underutilized by existing graph signal clustering methods. In this letter, we propose a new framework, named Kernel-based joint Multiple GL and clustering of graph signals (KMGL), that leverages a multi-convex optimization approach. This allows us to integrate node-side information, construct low-pass filters, and efficiently solve the optimization problem. The experiments demonstrate that KMGL significantly enhances the robustness of GL and clustering, particularly in scenarios with high noise levels and a substantial number of clusters. These findings underscore the potential of KMGL for improving the performance of GSP methods in diverse, real-world applications.
    Birth of a Transformer: A Memory Viewpoint. (arXiv:2306.00802v2 [stat.ML] UPDATED)
    Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.
    Multilingual Mathematical Autoformalization. (arXiv:2311.03755v1 [cs.CL])
    Autoformalization is the task of translating natural language materials into machine-verifiable formalisations. Progress in autoformalization research is hindered by the lack of a sizeable dataset consisting of informal-formal pairs expressing the same essence. Existing methods tend to circumvent this challenge by manually curating small corpora or using few-shot learning with large language models. But these methods suffer from data scarcity and formal language acquisition difficulty. In this work, we create $\texttt{MMA}$, a large, flexible, multilingual, and multi-domain dataset of informal-formal pairs, by using a language model to translate in the reverse direction, that is, from formal mathematical statements into corresponding informal ones. Experiments show that language models fine-tuned on $\texttt{MMA}$ produce $16-18\%$ of statements acceptable with minimal corrections on the $\texttt{miniF2F}$ and $\texttt{ProofNet}$ benchmarks, up from $0\%$ with the base model. We demonstrate that fine-tuning on multilingual formal data results in more capable autoformalization models even when deployed on monolingual tasks.
    Convergence Analysis of Mean Shift. (arXiv:2305.08463v3 [stat.ML] UPDATED)
    The mean shift (MS) algorithm seeks a mode of the kernel density estimate (KDE). This study presents a convergence guarantee of the mode estimate sequence generated by the MS algorithm and an evaluation of the convergence rate, under fairly mild conditions, with the help of the argument concerning the {\L}ojasiewicz inequality. Our findings extend existing ones covering analytic kernels and the Epanechnikov kernel. Those are significant in that they cover the biweight kernel, which is optimal among non-negative kernels in terms of the asymptotic statistical efficiency for the KDE-based mode estimation.
    Determination of droplet size from wide-angle light scattering image data using convolutional neural networks. (arXiv:2311.03387v1 [cs.CV])
    Wide-angle light scattering (WALS) offers the possibility of a highly temporally and spatially resolved measurement of droplets in spray-based methods for nanoparticle synthesis. The size of these droplets is a critical variable affecting the final properties of synthesized materials such as hetero-aggregates. However, conventional methods for determining droplet sizes from WALS image data are labor-intensive and may introduce biases, particularly when applied to complex systems like spray flame synthesis (SFS). To address these challenges, we introduce a fully automatic machine learning-based approach that employs convolutional neural networks (CNNs) in order to streamline the droplet sizing process. This CNN-based methodology offers further advantages: it requires few manual labels and can utilize transfer learning, making it a promising alternative to conventional methods, specifically with respect to efficiency. To evaluate the performance of our machine learning models, we consider WALS data from an ethanol spray flame process at various heights above the burner surface (HABs), where the models are trained and cross-validated on a large dataset comprising nearly 35000 WALS images.
    In-Context Exemplars as Clues to Retrieving from Large Associative Memory. (arXiv:2311.03498v1 [cs.CL])
    Recently, large language models (LLMs) have made remarkable progress in natural language processing. The most representative ability of LLMs is in-context learning (ICL), which enables LLMs to learn patterns from in-context exemplars without training. The performance of ICL greatly depends on the exemplars used. However, how to choose exemplars remains unclear due to the lack of understanding of how in-context learning works. In this paper, we present a novel perspective on ICL by conceptualizing it as contextual retrieval from a model of associative memory. We establish a theoretical framework of ICL based on Hopfield Networks. Based on our framework, we look into how in-context exemplars influence the performance of ICL and propose more efficient active exemplar selection. Our study sheds new light on the mechanism of ICL by connecting it to memory retrieval, with potential implications for advancing the understanding of LLMs.
    When Noisy Labels Meet Long Tail Dilemmas: A Representation Calibration Method. (arXiv:2211.10955v3 [cs.LG] UPDATED)
    Real-world large-scale datasets are both noisily labeled and class-imbalanced. The issues seriously hurt the generalization of trained models. It is hence significant to address the simultaneous incorrect labeling and class-imbalance, i.e., the problem of learning with noisy labels on long-tailed data. Previous works develop several methods for the problem. However, they always rely on strong assumptions that are invalid or hard to be checked in practice. In this paper, to handle the problem and address the limitations of prior works, we propose a representation calibration method RCAL. Specifically, RCAL works with the representations extracted by unsupervised contrastive learning. We assume that without incorrect labeling and class imbalance, the representations of instances in each class conform to a multivariate Gaussian distribution, which is much milder and easier to be checked. Based on the assumption, we recover underlying representation distributions from polluted ones resulting from mislabeled and class-imbalanced data. Additional data points are then sampled from the recovered distributions to help generalization. Moreover, during classifier training, representation learning takes advantage of representation robustness brought by contrastive learning, which further improves the classifier performance. We derive theoretical results to discuss the effectiveness of our representation calibration. Experiments on multiple benchmarks justify our claims and confirm the superiority of the proposed method.  ( 3 min )
    Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case. (arXiv:2208.14960v3 [stat.ME] UPDATED)
    Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.
    Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery. (arXiv:2206.10540v4 [cs.LG] UPDATED)
    This paper revisits datasets and evaluation criteria for Symbolic Regression (SR), specifically focused on its potential for scientific discovery. Focused on a set of formulas used in the existing datasets based on Feynman Lectures on Physics, we recreate 120 datasets to discuss the performance of symbolic regression for scientific discovery (SRSD). For each of the 120 SRSD datasets, we carefully review the properties of the formula and its variables to design reasonably realistic sampling ranges of values so that our new SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method can (re)discover physical laws from such datasets. We also create another 120 datasets that contain dummy variables to examine whether SR methods can choose necessary variables only. Besides, we propose to use normalized edit distances (NED) between a predicted equation and the true equation trees for addressing a critical issue that existing SR metrics are either binary or errors between the target values and an SR model's predicted values for a given input. We conduct benchmark experiments on our new SRSD datasets using various representative SR methods. The experimental results show that we provide a more realistic performance evaluation, and our user study shows that the NED correlates with human judges significantly more than an existing SR metric.  ( 3 min )
    Ensembling Textual and Structure-Based Models for Knowledge Graph Completion. (arXiv:2311.03780v1 [cs.CL])
    We consider two popular approaches to Knowledge Graph Completion (KGC): textual models that rely on textual entity descriptions, and structure-based models that exploit the connectivity structure of the Knowledge Graph (KG). Preliminary experiments show that these approaches have complementary strengths: structure-based models perform well when the gold answer is easily reachable from the query head in the KG, while textual models exploit descriptions to give good performance even when the gold answer is not reachable. In response, we explore ensembling as a way of combining the best of both approaches. We propose a novel method for learning query-dependent ensemble weights by using the distributions of scores assigned by individual models to all candidate entities. Our ensemble baseline achieves state-of-the-art results on three standard KGC datasets, with up to 6.8 pt MRR and 8.3 pt Hits@1 gains over best individual models.  ( 2 min )
    Using Sum-Product Networks to Assess Uncertainty in Deep Active Learning. (arXiv:2206.09798v2 [cs.LG] UPDATED)
    The success of deep active learning hinges on the choice of an effective acquisition function, which ranks not yet labeled data points according to their expected informativeness. Many acquisition functions are (partly) based on the uncertainty that the current model has about the class label of a point, yet there is no generally agreed upon strategy for computing such uncertainty. This paper proposes a new and very simple approach to computing uncertainty in deep active learning with a Convolutional Neural Network (CNN). The main idea is to use the feature representation extracted by the CNN as data for training a Sum-Product Network (SPN). Since SPNs are typically used for estimating the distribution of a dataset, they are well suited to the task of estimating class probabilities that can be used directly by standard acquisition functions such as max entropy and variational ratio. The effectiveness of our method is demonstrated in an experimental study on several standard benchmark datasets for image classification, where we compare it to various state-of-the-art methods for assessing uncertainty in deep active learning.  ( 2 min )
    Probabilistic Categorical Adversarial Attack & Adversarial Training. (arXiv:2210.09364v3 [cs.LG] UPDATED)
    The existence of adversarial examples brings huge concern for people to apply Deep Neural Networks (DNNs) in safety-critical tasks. However, how to generate adversarial examples with categorical data is an important problem but lack of extensive exploration. Previously established methods leverage greedy search method, which can be very time-consuming to conduct successful attack. This also limits the development of adversarial training and potential defenses for categorical data. To tackle this problem, we propose Probabilistic Categorical Adversarial Attack (PCAA), which transfers the discrete optimization problem to a continuous problem that can be solved efficiently by Projected Gradient Descent. In our paper, we theoretically analyze its optimality and time complexity to demonstrate its significant advantage over current greedy based attacks. Moreover, based on our attack, we propose an efficient adversarial training framework. Through a comprehensive empirical study, we justify the effectiveness of our proposed attack and defense algorithms.  ( 2 min )
    Accurate 3D Object Detection using Energy-Based Models. (arXiv:2012.04634v2 [cs.CV] UPDATED)
    Accurate 3D object detection (3DOD) is crucial for safe navigation of complex environments by autonomous robots. Regressing accurate 3D bounding boxes in cluttered environments based on sparse LiDAR data is however a highly challenging problem. We address this task by exploring recent advances in conditional energy-based models (EBMs) for probabilistic regression. While methods employing EBMs for regression have demonstrated impressive performance on 2D object detection in images, these techniques are not directly applicable to 3D bounding boxes. In this work, we therefore design a differentiable pooling operator for 3D bounding boxes, serving as the core module of our EBM network. We further integrate this general approach into the state-of-the-art 3D object detector SA-SSD. On the KITTI dataset, our proposed approach consistently outperforms the SA-SSD baseline across all 3DOD metrics, demonstrating the potential of EBM-based regression for highly accurate 3DOD. Code is available at https://github.com/fregu856/ebms_3dod.  ( 2 min )
    Neuro-GPT: Developing A Foundation Model for EEG. (arXiv:2311.03764v1 [cs.LG])
    To handle the scarcity and heterogeneity of electroencephalography (EEG) data in Brain-Computer Interface (BCI) tasks, and to harness the vast public data, we propose Neuro-GPT, a foundation model consisting of an EEG encoder and a GPT model. The foundation model is pre-trained on a large-scale public EEG dataset, using a self-supervised task which learns how to reconstruct the masked chunk in EEG. We then fine-tune the foundation model on a Motor Imagery Classification task where only 9 subjects are available. Experiments demonstrated that applying foundation model can significantly improve classification performance compared to the model trained from scratch, which provides evidence for the advanced generalizability of foundation model and the ability to address the challenges of data scarcity and heterogeneity.  ( 2 min )
    Asynchronous Local Computations in Distributed Bayesian Learning. (arXiv:2311.03496v1 [cs.LG])
    Due to the expanding scope of machine learning (ML) to the fields of sensor networking, cooperative robotics and many other multi-agent systems, distributed deployment of inference algorithms has received a lot of attention. These algorithms involve collaboratively learning unknown parameters from dispersed data collected by multiple agents. There are two competing aspects in such algorithms, namely, intra-agent computation and inter-agent communication. Traditionally, algorithms are designed to perform both synchronously. However, certain circumstances need frugal use of communication channels as they are either unreliable, time-consuming, or resource-expensive. In this paper, we propose gossip-based asynchronous communication to leverage fast computations and reduce communication overhead simultaneously. We analyze the effects of multiple (local) intra-agent computations by the active agents between successive inter-agent communications. For local computations, Bayesian sampling via unadjusted Langevin algorithm (ULA) MCMC is utilized. The communication is assumed to be over a connected graph (e.g., as in decentralized learning), however, the results can be extended to coordinated communication where there is a central server (e.g., federated learning). We theoretically quantify the convergence rates in the process. To demonstrate the efficacy of the proposed algorithm, we present simulations on a toy problem as well as on real world data sets to train ML models to perform classification tasks. We observe faster initial convergence and improved performance accuracy, especially in the low data range. We achieve on average 78% and over 90% classification accuracy respectively on the Gamma Telescope and mHealth data sets from the UCI ML repository.  ( 3 min )
    Context Shift Reduction for Offline Meta-Reinforcement Learning. (arXiv:2311.03695v1 [cs.LG])
    Offline meta-reinforcement learning (OMRL) utilizes pre-collected offline datasets to enhance the agent's generalization ability on unseen tasks. However, the context shift problem arises due to the distribution discrepancy between the contexts used for training (from the behavior policy) and testing (from the exploration policy). The context shift problem leads to incorrect task inference and further deteriorates the generalization ability of the meta-policy. Existing OMRL methods either overlook this problem or attempt to mitigate it with additional information. In this paper, we propose a novel approach called Context Shift Reduction for OMRL (CSRO) to address the context shift problem with only offline datasets. The key insight of CSRO is to minimize the influence of policy in context during both the meta-training and meta-test phases. During meta-training, we design a max-min mutual information representation learning mechanism to diminish the impact of the behavior policy on task representation. In the meta-test phase, we introduce the non-prior context collection strategy to reduce the effect of the exploration policy. Experimental results demonstrate that CSRO significantly reduces the context shift and improves the generalization ability, surpassing previous methods across various challenging domains.  ( 2 min )
    Sparse Interaction Additive Networks via Feature Interaction Detection and Sparse Selection. (arXiv:2209.09326v2 [cs.LG] UPDATED)
    There is currently a large gap in performance between the statistically rigorous methods like linear regression or additive splines and the powerful deep methods using neural networks. Previous works attempting to close this gap have failed to fully investigate the exponentially growing number of feature combinations which deep networks consider automatically during training. In this work, we develop a tractable selection algorithm to efficiently identify the necessary feature combinations by leveraging techniques in feature interaction detection. Our proposed Sparse Interaction Additive Networks (SIAN) construct a bridge from these simple and interpretable models to fully connected neural networks. SIAN achieves competitive performance against state-of-the-art methods across multiple large-scale tabular datasets and consistently finds an optimal tradeoff between the modeling capacity of neural networks and the generalizability of simpler methods.  ( 2 min )
    Improved weight initialization for deep and narrow feedforward neural network. (arXiv:2311.03733v1 [cs.LG])
    Appropriate weight initialization settings, along with the ReLU activation function, have been a cornerstone of modern deep learning, making it possible to train and deploy highly effective and efficient neural network models across diverse artificial intelligence. The problem of dying ReLU, where ReLU neurons become inactive and yield zero output, presents a significant challenge in the training of deep neural networks with ReLU activation function. Theoretical research and various methods have been introduced to address the problem. However, even with these methods and research, training remains challenging for extremely deep and narrow feedforward networks with ReLU activation function. In this paper, we propose a new weight initialization method to address this issue. We prove the properties of the proposed initial weight matrix and demonstrate how these properties facilitate the effective propagation of signal vectors. Through a series of experiments and comparisons with existing methods, we demonstrate the effectiveness of the new initialization method.  ( 2 min )
    UP-NeRF: Unconstrained Pose-Prior-Free Neural Radiance Fields. (arXiv:2311.03784v1 [cs.CV])
    Neural Radiance Field (NeRF) has enabled novel view synthesis with high fidelity given images and camera poses. Subsequent works even succeeded in eliminating the necessity of pose priors by jointly optimizing NeRF and camera pose. However, these works are limited to relatively simple settings such as photometrically consistent and occluder-free image collections or a sequence of images from a video. So they have difficulty handling unconstrained images with varying illumination and transient occluders. In this paper, we propose \textbf{UP-NeRF} (\textbf{U}nconstrained \textbf{P}ose-prior-free \textbf{Ne}ural \textbf{R}adiance \textbf{F}ields) to optimize NeRF with unconstrained image collections without camera pose prior. We tackle these challenges with surrogate tasks that optimize color-insensitive feature fields and a separate module for transient occluders to block their influence on pose estimation. In addition, we introduce a candidate head to enable more robust pose estimation and transient-aware depth supervision to minimize the effect of incorrect prior. Our experiments verify the superior performance of our method compared to the baselines including BARF and its variants in a challenging internet photo collection, \textit{Phototourism} dataset. The code of UP-NeRF is available at \url{https://github.com/mlvlab/UP-NeRF}.  ( 2 min )
    ProPath: Disease-Specific Protein Language Model for Variant Pathogenicity. (arXiv:2311.03429v1 [q-bio.GN])
    Clinical variant classification of pathogenic versus benign genetic variants remains a pivotal challenge in clinical genetics. Recently, the proposition of protein language models has improved the generic variant effect prediction (VEP) accuracy via weakly-supervised or unsupervised training. However, these VEPs are not disease-specific, limiting their adaptation at point-of-care. To address this problem, we propose a disease-specific \textsc{pro}tein language model for variant \textsc{path}ogenicity, termed ProPath, to capture the pseudo-log-likelihood ratio in rare missense variants through a siamese network. We evaluate the performance of ProPath against pre-trained language models, using clinical variant sets in inherited cardiomyopathies and arrhythmias that were not seen during training. Our results demonstrate that ProPath surpasses the pre-trained ESM1b with an over $5\%$ improvement in AUC across both datasets. Furthermore, our model achieved the highest performances across all baselines for both datasets. Thus, our ProPath offers a potent disease-specific variant effect prediction, particularly valuable for disease associations and clinical applicability.  ( 2 min )
    Generalizability of Adversarial Robustness Under Distribution Shifts. (arXiv:2209.15042v3 [cs.LG] UPDATED)
    Recent progress in empirical and certified robustness promises to deliver reliable and deployable Deep Neural Networks (DNNs). Despite that success, most existing evaluations of DNN robustness have been done on images sampled from the same distribution on which the model was trained. However, in the real world, DNNs may be deployed in dynamic environments that exhibit significant distribution shifts. In this work, we take a first step towards thoroughly investigating the interplay between empirical and certified adversarial robustness on one hand and domain generalization on another. To do so, we train robust models on multiple domains and evaluate their accuracy and robustness on an unseen domain. We observe that: (1) both empirical and certified robustness generalize to unseen domains, and (2) the level of generalizability does not correlate well with input visual similarity, measured by the FID between source and target domains. We also extend our study to cover a real-world medical application, in which adversarial augmentation significantly boosts the generalization of robustness with minimal effect on clean data accuracy.  ( 2 min )
    An Explainable Framework for Machine learning-Based Reactive Power Optimization of Distribution Network. (arXiv:2311.03863v1 [eess.SY])
    To reduce the heavy computational burden of reactive power optimization of distribution networks, machine learning models are receiving increasing attention. However, most machine learning models (e.g., neural networks) are usually considered as black boxes, making it challenging for power system operators to identify and comprehend potential biases or errors in the decision-making process of machine learning models. To address this issue, an explainable machine-learning framework is proposed to optimize the reactive power in distribution networks. Firstly, a Shapley additive explanation framework is presented to measure the contribution of each input feature to the solution of reactive power optimizations generated from machine learning models. Secondly, a model-agnostic approximation method is developed to estimate Shapley values, so as to avoid the heavy computational burden associated with direct calculations of Shapley values. The simulation results show that the proposed explainable framework can accurately explain the solution of the machine learning model-based reactive power optimization by using visual analytics, from both global and instance perspectives. Moreover, the proposed explainable framework is model-agnostic, and thus applicable to various models (e.g., neural networks).  ( 2 min )
    Structure of universal formulas. (arXiv:2311.03910v1 [cs.LG])
    By universal formulas we understand parameterized analytic expressions that have a fixed complexity, but nevertheless can approximate any continuous function on a compact set. There exist various examples of such formulas, including some in the form of neural networks. In this paper we analyze the essential structural elements of these highly expressive models. We introduce a hierarchy of expressiveness classes connecting the global approximability property to the weaker property of infinite VC dimension, and prove a series of classification results for several increasingly complex functional families. In particular, we introduce a general family of polynomially-exponentially-algebraic functions that, as we prove, is subject to polynomial constraints. As a consequence, we show that fixed-size neural networks with not more than one layer of neurons having transcendental activations (e.g., sine or standard sigmoid) cannot in general approximate functions on arbitrary finite sets. On the other hand, we give examples of functional families, including two-hidden-layer neural networks, that approximate functions on arbitrary finite sets, but fail to do that on the whole domain of definition.  ( 2 min )
    SeRO: Self-Supervised Reinforcement Learning for Recovery from Out-of-Distribution Situations. (arXiv:2311.03651v1 [cs.LG])
    Robotic agents trained using reinforcement learning have the problem of taking unreliable actions in an out-of-distribution (OOD) state. Agents can easily become OOD in real-world environments because it is almost impossible for them to visit and learn the entire state space during training. Unfortunately, unreliable actions do not ensure that agents perform their original tasks successfully. Therefore, agents should be able to recognize whether they are in OOD states and learn how to return to the learned state distribution rather than continue to take unreliable actions. In this study, we propose a novel method for retraining agents to recover from OOD situations in a self-supervised manner when they fall into OOD states. Our in-depth experimental results demonstrate that our method substantially improves the agent's ability to recover from OOD situations in terms of sample efficiency and restoration of the performance for the original tasks. Moreover, we show that our method can retrain the agent to recover from OOD situations even when in-distribution states are difficult to visit through exploration.  ( 2 min )
    Posterior Sampling-Based Bayesian Optimization with Tighter Bayesian Regret Bounds. (arXiv:2311.03760v1 [cs.LG])
    Among various acquisition functions (AFs) in Bayesian optimization (BO), Gaussian process upper confidence bound (GP-UCB) and Thompson sampling (TS) are well-known options with established theoretical properties regarding Bayesian cumulative regret (BCR). Recently, it has been shown that a randomized variant of GP-UCB achieves a tighter BCR bound compared with GP-UCB, which we call the tighter BCR bound for brevity. Inspired by this study, this paper first shows that TS achieves the tighter BCR bound. On the other hand, GP-UCB and TS often practically suffer from manual hyperparameter tuning and over-exploration issues, respectively. To overcome these difficulties, we propose yet another AF called a probability of improvement from the maximum of a sample path (PIMS). We show that PIMS achieves the tighter BCR bound and avoids the hyperparameter tuning, unlike GP-UCB. Furthermore, we demonstrate a wide range of experiments, focusing on the effectiveness of PIMS that mitigates the practical issues of GP-UCB and TS.  ( 2 min )
    Bandit Pareto Set Identification: the Fixed Budget Setting. (arXiv:2311.03992v1 [stat.ML])
    We study a multi-objective pure exploration problem in a multi-armed bandit model. Each arm is associated to an unknown multi-variate distribution and the goal is to identify the distributions whose mean is not uniformly worse than that of another distribution: the Pareto optimal set. We propose and analyze the first algorithms for the \emph{fixed budget} Pareto Set Identification task. We propose Empirical Gap Elimination, a family of algorithms combining a careful estimation of the ``hardness to classify'' each arm in or out of the Pareto set with a generic elimination scheme. We prove that two particular instances, EGE-SR and EGE-SH, have a probability of error that decays exponentially fast with the budget, with an exponent supported by an information theoretic lower-bound. We complement these findings with an empirical study using real-world and synthetic datasets, which showcase the good performance of our algorithms.  ( 2 min )
    TWIST: Teacher-Student World Model Distillation for Efficient Sim-to-Real Transfer. (arXiv:2311.03622v1 [cs.RO])
    Model-based RL is a promising approach for real-world robotics due to its improved sample efficiency and generalization capabilities compared to model-free RL. However, effective model-based RL solutions for vision-based real-world applications require bridging the sim-to-real gap for any world model learnt. Due to its significant computational cost, standard domain randomisation does not provide an effective solution to this problem. This paper proposes TWIST (Teacher-Student World Model Distillation for Sim-to-Real Transfer) to achieve efficient sim-to-real transfer of vision-based model-based RL using distillation. Specifically, TWIST leverages state observations as readily accessible, privileged information commonly garnered from a simulator to significantly accelerate sim-to-real transfer. Specifically, a teacher world model is trained efficiently on state information. At the same time, a matching dataset is collected of domain-randomised image observations. The teacher world model then supervises a student world model that takes the domain-randomised image observations as input. By distilling the learned latent dynamics model from the teacher to the student model, TWIST achieves efficient and effective sim-to-real transfer for vision-based model-based RL tasks. Experiments in simulated and real robotics tasks demonstrate that our approach outperforms naive domain randomisation and model-free methods in terms of sample efficiency and task performance of sim-to-real transfer.  ( 2 min )
    Neural Rankers for Code Generation via Inter-Cluster Modeling. (arXiv:2311.03366v1 [cs.SE])
    Code Large Language Models (CodeLLMs) have ushered in a new era of code generation advancements. However, selecting the best solutions from among all possible CodeLLM solutions remains a challenge. Previous methods frequently overlooked the intricate functional similarities and interactions between clusters, resulting in suboptimal results. In this work, we introduce \textit{SRank}, a novel reranking strategy for selecting the best solution from code generation that focuses on modeling inter-cluster relationship. By quantifying the functional overlap between clusters, our approach provides a better ranking strategy of code solutions. Empirical results show that our method achieves a remarkable results on pass@1 score. For instance, on the Human-Eval benchmark, we achieve 69.66\% in pass@1 with Codex002, 75.31\% for WizardCoder, 53.99\% for StarCoder and 60.55\% for CodeGen, which surpass the state-of-the-arts solution ranking methods, such as CodeT and Coder-Reviewer on the same CodeLLM with significant margin ($\approx 6.1\%$ improvement on average). Comparing to the random sampling method, we can achieve an average improvement of $\approx 23.07\%$ on Human-Eval and 17.64\% on MBPP. Even in scenarios with limited test inputs, our approach demonstrates robustness and superiority, marking a new state-of-the-arts in code generation reranking.
    Learning to Learn for Few-shot Continual Active Learning. (arXiv:2311.03732v1 [cs.LG])
    Continual learning strives to ensure stability in solving previously seen tasks while demonstrating plasticity in a novel domain. Recent advances in CL are mostly confined to a supervised learning setting, especially in NLP domain. In this work, we consider a few-shot continual active learning (CAL) setting where labeled data is inadequate, and unlabeled data is abundant but with a limited annotation budget. We propose a simple but efficient method, called Meta-Continual Active Learning. Specifically, we employ meta-learning and experience replay to address the trade-off between stability and plasticity. As a result, it finds an optimal initialization that efficiently utilizes annotated information for fast adaptation while preventing catastrophic forgetting of past tasks. We conduct extensive experiments to validate the effectiveness of the proposed method and analyze the effect of various active learning strategies and memory sample selection methods in a few-shot CAL setup. Our experiment results demonstrate that random sampling is the best default strategy for both active learning and memory sample selection to solve few-shot CAL problems.
    Augmenting Radio Signals with Wavelet Transform for Deep Learning-Based Modulation Recognition. (arXiv:2311.03761v1 [cs.LG])
    The use of deep learning for radio modulation recognition has become prevalent in recent years. This approach automatically extracts high-dimensional features from large datasets, facilitating the accurate classification of modulation schemes. However, in real-world scenarios, it may not be feasible to gather sufficient training data in advance. Data augmentation is a method used to increase the diversity and quantity of training dataset and to reduce data sparsity and imbalance. In this paper, we propose data augmentation methods that involve replacing detail coefficients decomposed by discrete wavelet transform for reconstructing to generate new samples and expand the training set. Different generation methods are used to generate replacement sequences. Simulation results indicate that our proposed methods significantly outperform the other augmentation methods.
    Inexact bilevel stochastic gradient methods for constrained and unconstrained lower-level problems. (arXiv:2110.00604v3 [math.OC] UPDATED)
    Two-level stochastic optimization formulations have become instrumental in a number of machine learning contexts such as continual learning, neural architecture search, adversarial learning, and hyperparameter tuning. Practical stochastic bilevel optimization problems become challenging in optimization or learning scenarios where the number of variables is high or there are constraints. In this paper, we introduce a bilevel stochastic gradient method for bilevel problems with nonlinear and possibly nonconvex lower-level constraints. We also present a comprehensive convergence theory that addresses both the lower-level unconstrained and constrained cases and covers all inexact calculations of the adjoint gradient (also called hypergradient), such as the inexact solution of the lower-level problem, inexact computation of the adjoint formula (due to the inexact solution of the adjoint equation or use of a truncated Neumann series), and noisy estimates of the gradients, Hessians, and Jacobians involved. To promote the use of bilevel optimization in large-scale learning, we have developed new low-rank practical bilevel stochastic gradient methods (BSG-N-FD and~BSG-1) that do not require second-order derivatives and, in the lower-level unconstrained case, dismiss any matrix-vector products.  ( 2 min )
    Geodesic Multi-Modal Mixup for Robust Fine-Tuning. (arXiv:2203.03897v4 [cs.CV] UPDATED)
    Pre-trained multi-modal models, such as CLIP, provide transferable embeddings and show promising results in diverse applications. However, the analysis of learned multi-modal embeddings is relatively unexplored, and the embedding transferability can be improved. In this work, we observe that CLIP holds separated embedding subspaces for two different modalities, and then we investigate it through the lens of uniformity-alignment to measure the quality of learned representation. Both theoretically and empirically, we show that CLIP retains poor uniformity and alignment even after fine-tuning. Such a lack of alignment and uniformity might restrict the transferability and robustness of embeddings. To this end, we devise a new fine-tuning method for robust representation equipping better alignment and uniformity. First, we propose a Geodesic Multi-Modal Mixup that mixes the embeddings of image and text to generate hard negative samples on the hypersphere. Then, we fine-tune the model on hard negatives as well as original negatives and positives with contrastive loss. Based on the theoretical analysis about hardness guarantee and limiting behavior, we justify the use of our method. Extensive experiments on retrieval, calibration, few- or zero-shot classification (under distribution shift), embedding arithmetic, and image captioning further show that our method provides transferable representations, enabling robust model adaptation on diverse tasks. Code: https://github.com/changdaeoh/multimodal-mixup  ( 3 min )
    FD-MIA: Efficient Attacks on Fairness-enhanced Models. (arXiv:2311.03865v1 [cs.LG])
    Previous studies have developed fairness methods for biased models that exhibit discriminatory behaviors towards specific subgroups. While these models have shown promise in achieving fair predictions, recent research has identified their potential vulnerability to score-based membership inference attacks (MIAs). In these attacks, adversaries can infer whether a particular data sample was used during training by analyzing the model's prediction scores. However, our investigations reveal that these score-based MIAs are ineffective when targeting fairness-enhanced models in binary classifications. The attack models trained to launch the MIAs degrade into simplistic threshold models, resulting in lower attack performance. Meanwhile, we observe that fairness methods often lead to prediction performance degradation for the majority subgroups of the training data. This raises the barrier to successful attacks and widens the prediction gaps between member and non-member data. Building upon these insights, we propose an efficient MIA method against fairness-enhanced models based on fairness discrepancy results (FD-MIA). It leverages the difference in the predictions from both the original and fairness-enhanced models and exploits the observed prediction gaps as attack clues. We also explore potential strategies for mitigating privacy leakages. Extensive experiments validate our findings and demonstrate the efficacy of the proposed method.  ( 2 min )
    PT-Tuning: Bridging the Gap between Time Series Masked Reconstruction and Forecasting via Prompt Token Tuning. (arXiv:2311.03768v1 [cs.LG])
    Self-supervised learning has been actively studied in time series domain recently, especially for masked reconstruction. Most of these methods follow the "Pre-training + Fine-tuning" paradigm in which a new decoder replaces the pre-trained decoder to fit for a specific downstream task, leading to inconsistency of upstream and downstream tasks. In this paper, we first point out that the unification of task objectives and adaptation for task difficulty are critical for bridging the gap between time series masked reconstruction and forecasting. By reserving the pre-trained mask token during fine-tuning stage, the forecasting task can be taken as a special case of masked reconstruction, where the future values are masked and reconstructed based on history values. It guarantees the consistency of task objectives but there is still a gap in task difficulty. Because masked reconstruction can utilize contextual information while forecasting can only use historical information to reconstruct. To further mitigate the existed gap, we propose a simple yet effective prompt token tuning (PT-Tuning) paradigm, in which all pre-trained parameters are frozen and only a few trainable prompt tokens are added to extended mask tokens in element-wise manner. Extensive experiments on real-world datasets demonstrate the superiority of our proposed paradigm with state-of-the-art performance compared to representation learning and end-to-end supervised forecasting methods.  ( 3 min )
    Aspects of human memory and Large Language Models. (arXiv:2311.03839v1 [cs.CL])
    Large Language Models (LLMs) are huge artificial neural networks which primarily serve to generate text, but also provide a very sophisticated probabilistic model of language use. Since generating a semantically consistent text requires a form of effective memory, we investigate the memory properties of LLMs and find surprising similarities with key characteristics of human memory. This result strongly suggests that the biological features of human memory leave an imprint on the way that we structure our textual narratives.  ( 2 min )
    Cup Curriculum: Curriculum Learning on Model Capacity. (arXiv:2311.03956v1 [cs.LG])
    Curriculum learning (CL) aims to increase the performance of a learner on a given task by applying a specialized learning strategy. This strategy focuses on either the dataset, the task, or the model. There is little to no work analysing the possibilities to apply CL on the model capacity in natural language processing. To close this gap, we propose the cup curriculum. In a first phase of training we use a variation of iterative magnitude pruning to reduce model capacity. These weights are reintroduced in a second phase, resulting in the model capacity to show a cup-shaped curve over the training iterations. We empirically evaluate different strategies of the cup curriculum and show that it outperforms early stopping reliably while exhibiting a high resilience to overfitting.  ( 2 min )
    User-level Differentially Private Stochastic Convex Optimization: Efficient Algorithms with Optimal Rates. (arXiv:2311.03797v1 [cs.LG])
    We study differentially private stochastic convex optimization (DP-SCO) under user-level privacy, where each user may hold multiple data items. Existing work for user-level DP-SCO either requires super-polynomial runtime [Ghazi et al. (2023)] or requires the number of users to grow polynomially with the dimensionality of the problem with additional strict assumptions [Bassily et al. (2023)]. We develop new algorithms for user-level DP-SCO that obtain optimal rates for both convex and strongly convex functions in polynomial time and require the number of users to grow only logarithmically in the dimension. Moreover, our algorithms are the first to obtain optimal rates for non-smooth functions in polynomial time. These algorithms are based on multiple-pass DP-SGD, combined with a novel private mean estimation procedure for concentrated data, which applies an outlier removal step before estimating the mean of the gradients.  ( 2 min )
    Unsupervised Video Summarization. (arXiv:2311.03745v1 [cs.CV])
    This paper introduces a new, unsupervised method for automatic video summarization using ideas from generative adversarial networks but eliminating the discriminator, having a simple loss function, and separating training of different parts of the model. An iterative training strategy is also applied by alternately training the reconstructor and the frame selector for multiple iterations. Furthermore, a trainable mask vector is added to the model in summary generation during training and evaluation. The method also includes an unsupervised model selection algorithm. Results from experiments on two public datasets (SumMe and TVSum) and four datasets we created (Soccer, LoL, MLB, and ShortMLB) demonstrate the effectiveness of each component on the model performance, particularly the iterative training strategy. Evaluations and comparisons with the state-of-the-art methods highlight the advantages of the proposed method in performance, stability, and training efficiency.  ( 2 min )
    Temporal Graph Representation Learning with Adaptive Augmentation Contrastive. (arXiv:2311.03897v1 [cs.LG])
    Temporal graph representation learning aims to generate low-dimensional dynamic node embeddings to capture temporal information as well as structural and property information. Current representation learning methods for temporal networks often focus on capturing fine-grained information, which may lead to the model capturing random noise instead of essential semantic information. While graph contrastive learning has shown promise in dealing with noise, it only applies to static graphs or snapshots and may not be suitable for handling time-dependent noise. To alleviate the above challenge, we propose a novel Temporal Graph representation learning with Adaptive augmentation Contrastive (TGAC) model. The adaptive augmentation on the temporal graph is made by combining prior knowledge with temporal information, and the contrastive objective function is constructed by defining the augmented inter-view contrast and intra-view contrast. To complement TGAC, we propose three adaptive augmentation strategies that modify topological features to reduce noise from the network. Our extensive experiments on various real networks demonstrate that the proposed model outperforms other temporal graph representation learning methods.  ( 2 min )
    The NeurIPS 2022 Neural MMO Challenge: A Massively Multiagent Competition with Specialization and Trade. (arXiv:2311.03707v1 [cs.AI])
    In this paper, we present the results of the NeurIPS-2022 Neural MMO Challenge, which attracted 500 participants and received over 1,600 submissions. Like the previous IJCAI-2022 Neural MMO Challenge, it involved agents from 16 populations surviving in procedurally generated worlds by collecting resources and defeating opponents. This year's competition runs on the latest v1.6 Neural MMO, which introduces new equipment, combat, trading, and a better scoring system. These elements combine to pose additional robustness and generalization challenges not present in previous competitions. This paper summarizes the design and results of the challenge, explores the potential of this environment as a benchmark for learning methods, and presents some practical reinforcement learning training approaches for complex tasks with sparse rewards. Additionally, we have open-sourced our baselines, including environment wrappers, benchmarks, and visualization tools for future research.  ( 2 min )
    Structural Causal Models Reveal Confounder Bias in Linear Program Modelling. (arXiv:2105.12697v6 [cs.LG] UPDATED)
    The recent years have been marked by extended research on adversarial attacks, especially on deep neural networks. With this work we intend on posing and investigating the question of whether the phenomenon might be more general in nature, that is, adversarial-style attacks outside classical classification tasks. Specifically, we investigate optimization problems as they constitute a fundamental part of modern AI research. To this end, we consider the base class of optimizers namely Linear Programs (LPs). On our initial attempt of a na\"ive mapping between the formalism of adversarial examples and LPs, we quickly identify the key ingredients missing for making sense of a reasonable notion of adversarial examples for LPs. Intriguingly, the formalism of Pearl's notion to causality allows for the right description of adversarial like examples for LPs. Characteristically, we show the direct influence of the Structural Causal Model (SCM) onto the subsequent LP optimization, which ultimately exposes a notion of confounding in LPs (inherited by said SCM) that allows for adversarial-style attacks. We provide both the general proof formally alongside existential proofs of such intriguing LP-parameterizations based on SCM for three combinatorial problems, namely Linear Assignment, Shortest Path and a real world problem of energy systems.
    Non-Asymptotic Analysis of a UCB-based Top Two Algorithm. (arXiv:2210.05431v3 [stat.ML] UPDATED)
    A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance.  ( 2 min )
    Preventing Arbitrarily High Confidence on Far-Away Data in Point-Estimated Discriminative Neural Networks. (arXiv:2311.03683v1 [cs.LG])
    Discriminatively trained, deterministic neural networks are the de facto choice for classification problems. However, even though they achieve state-of-the-art results on in-domain test sets, they tend to be overconfident on out-of-distribution (OOD) data. For instance, ReLU networks -- a popular class of neural network architectures -- have been shown to almost always yield high confidence predictions when the test data are far away from the training set, even when they are trained with OOD data. We overcome this problem by adding a term to the output of the neural network that corresponds to the logit of an extra class, that we design to dominate the logits of the original classes as we move away from the training data.This technique provably prevents arbitrarily high confidence on far-away test data while maintaining a simple discriminative point-estimate training. Evaluation on various benchmarks demonstrates strong performance against competitive baselines on both far-away and realistic OOD data.  ( 2 min )
    Federated Learning for Clinical Structured Data: A Benchmark Comparison of Engineering and Statistical Approaches. (arXiv:2311.03417v1 [cs.LG])
    Federated learning (FL) has shown promising potential in safeguarding data privacy in healthcare collaborations. While the term "FL" was originally coined by the engineering community, the statistical field has also explored similar privacy-preserving algorithms. Statistical FL algorithms, however, remain considerably less recognized than their engineering counterparts. Our goal was to bridge the gap by presenting the first comprehensive comparison of FL frameworks from both engineering and statistical domains. We evaluated five FL frameworks using both simulated and real-world data. The results indicate that statistical FL algorithms yield less biased point estimates for model coefficients and offer convenient confidence interval estimations. In contrast, engineering-based methods tend to generate more accurate predictions, sometimes surpassing central pooled and statistical FL models. This study underscores the relative strengths and weaknesses of both types of methods, emphasizing the need for increased awareness and their integration in future FL applications.  ( 2 min )
    Brain Networks and Intelligence: A Graph Neural Network Based Approach to Resting State fMRI Data. (arXiv:2311.03520v1 [cs.LG])
    Resting-state functional magnetic resonance imaging (rsfMRI) is a powerful tool for investigating the relationship between brain function and cognitive processes as it allows for the functional organization of the brain to be captured without relying on a specific task or stimuli. In this paper, we present a novel modeling architecture called BrainRGIN for predicting intelligence (fluid, crystallized, and total intelligence) using graph neural networks on rsfMRI derived static functional network connectivity matrices. Extending from the existing graph convolution networks, our approach incorporates a clustering-based embedding and graph isomorphism network in the graph convolutional layer to reflect the nature of the brain sub-network organization and efficient network expression, in combination with TopK pooling and attention-based readout functions. We evaluated our proposed architecture on a large dataset, specifically the Adolescent Brain Cognitive Development Dataset, and demonstrated its effectiveness in predicting individual differences in intelligence. Our model achieved lower mean squared errors and higher correlation scores than existing relevant graph architectures and other traditional machine learning models for all of the intelligence prediction tasks. The middle frontal gyrus exhibited a significant contribution to both fluid and crystallized intelligence, suggesting their pivotal role in these cognitive processes. Total composite scores identified a diverse set of brain regions to be relevant which underscores the complex nature of total intelligence.  ( 3 min )
    Enhanced physics-informed neural networks with domain scaling and residual correction methods for multi-frequency elliptic problems. (arXiv:2311.03746v1 [math.NA])
    In this paper, neural network approximation methods are developed for elliptic partial differential equations with multi-frequency solutions. Neural network work approximation methods have advantages over classical approaches in that they can be applied without much concerns on the form of the differential equations or the shape or dimension of the problem domain. When applied to problems with multi-frequency solutions, the performance and accuracy of neural network approximation methods are strongly affected by the contrast of the high- and low-frequency parts in the solutions. To address this issue, domain scaling and residual correction methods are proposed. The efficiency and accuracy of the proposed methods are demonstrated for multi-frequency model problems.  ( 2 min )
    A Novel Variational Lower Bound for Inverse Reinforcement Learning. (arXiv:2311.03698v1 [cs.LG])
    Inverse reinforcement learning (IRL) seeks to learn the reward function from expert trajectories, to understand the task for imitation or collaboration thereby removing the need for manual reward engineering. However, IRL in the context of large, high-dimensional problems with unknown dynamics has been particularly challenging. In this paper, we present a new Variational Lower Bound for IRL (VLB-IRL), which is derived under the framework of a probabilistic graphical model with an optimality node. Our method simultaneously learns the reward function and policy under the learned reward function by maximizing the lower bound, which is equivalent to minimizing the reverse Kullback-Leibler divergence between an approximated distribution of optimality given the reward function and the true distribution of optimality given trajectories. This leads to a new IRL method that learns a valid reward function such that the policy under the learned reward achieves expert-level performance on several known domains. Importantly, the method outperforms the existing state-of-the-art IRL algorithms on these domains by demonstrating better reward from the learned policy.  ( 2 min )
    Enhancing AI Research Paper Analysis: Methodology Component Extraction using Factored Transformer-based Sequence Modeling Approach. (arXiv:2311.03401v1 [cs.IR])
    Research in scientific disciplines evolves, often rapidly, over time with the emergence of novel methodologies and their associated terminologies. While methodologies themselves being conceptual in nature and rather difficult to automatically extract and characterise, in this paper, we seek to develop supervised models for automatic extraction of the names of the various constituents of a methodology, e.g., `R-CNN', `ELMo' etc. The main research challenge for this task is effectively modeling the contexts around these methodology component names in a few-shot or even a zero-shot setting. The main contributions of this paper towards effectively identifying new evolving scientific methodology names are as follows: i) we propose a factored approach to sequence modeling, which leverages a broad-level category information of methodology domains, e.g., `NLP', `RL' etc.; ii) to demonstrate the feasibility of our proposed approach of identifying methodology component names under a practical setting of fast evolving AI literature, we conduct experiments following a simulated chronological setup (newer methodologies not seen during the training process); iii) our experiments demonstrate that the factored approach outperforms state-of-the-art baselines by margins of up to 9.257\% for the methodology extraction task with the few-shot setup.  ( 2 min )
    Multimodal deep representation learning for quantum cross-platform verification. (arXiv:2311.03713v1 [quant-ph])
    Cross-platform verification, a critical undertaking in the realm of early-stage quantum computing, endeavors to characterize the similarity of two imperfect quantum devices executing identical algorithms, utilizing minimal measurements. While the random measurement approach has been instrumental in this context, the quasi-exponential computational demand with increasing qubit count hurdles its feasibility in large-qubit scenarios. To bridge this knowledge gap, here we introduce an innovative multimodal learning approach, recognizing that the formalism of data in this task embodies two distinct modalities: measurement outcomes and classical description of compiled circuits on explored quantum devices, both enriched with unique information. Building upon this insight, we devise a multimodal neural network to independently extract knowledge from these modalities, followed by a fusion operation to create a comprehensive data representation. The learned representation can effectively characterize the similarity between the explored quantum devices when executing new quantum algorithms not present in the training data. We evaluate our proposal on platforms featuring diverse noise models, encompassing system sizes up to 50 qubits. The achieved results demonstrate a three-orders-of-magnitude improvement in prediction accuracy compared to the random measurements and offer compelling evidence of the complementary roles played by each modality in cross-platform verification. These findings pave the way for harnessing the power of multimodal learning to overcome challenges in wider quantum system learning tasks.  ( 2 min )
    GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values. (arXiv:2311.03426v1 [cs.LG])
    Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grouping techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.  ( 2 min )
    Leveraging High-Level Synthesis and Large Language Models to Generate, Simulate, and Deploy a Uniform Random Number Generator Hardware Design. (arXiv:2311.03489v1 [cs.AR])
    We present a new high-level synthesis methodology for using large language model tools to generate hardware designs. The methodology uses exclusively open-source tools excluding the large language model. As a case study, we use our methodology to generate a permuted congruential random number generator design with a wishbone interface. We verify the functionality and quality of the random number generator design using large language model-generated simulations and the Dieharder randomness test suite. We document all the large language model chat logs, Python scripts, Verilog scripts, and simulation results used in the case study. We believe that our method of hardware design generation coupled with the open source silicon 130 nm design tools will revolutionize application-specific integrated circuit design. Our methodology significantly lowers the bar to entry when building domain-specific computing accelerators for the Internet of Things and proof of concept prototypes for later fabrication in more modern process nodes.  ( 2 min )
    Testing RadiX-Nets: Advances in Viable Sparse Topologies. (arXiv:2311.03609v1 [cs.LG])
    The exponential growth of data has sparked computational demands on ML research and industry use. Sparsification of hyper-parametrized deep neural networks (DNNs) creates simpler representations of complex data. Past research has shown that some sparse networks achieve similar performance as dense ones, reducing runtime and storage. RadiX-Nets, a subgroup of sparse DNNs, maintain uniformity which counteracts their lack of neural connections. Generation, independent of a dense network, yields faster asymptotic training and removes the need for costly pruning. However, little work has been done on RadiX-Nets, making testing challenging. This paper presents a testing suite for RadiX-Nets in TensorFlow. We test RadiX-Net performance to streamline processing in scalable models, revealing relationships between network topology, initialization, and training behavior. We also encounter "strange models" that train inconsistently and to lower accuracy while models of similar sparsity train well.  ( 2 min )
    Communication Efficient and Privacy-Preserving Federated Learning Based on Evolution Strategies. (arXiv:2311.03405v1 [cs.LG])
    Federated learning (FL) is an emerging paradigm for training deep neural networks (DNNs) in distributed manners. Current FL approaches all suffer from high communication overhead and information leakage. In this work, we present a federated learning algorithm based on evolution strategies (FedES), a zeroth-order training method. Instead of transmitting model parameters, FedES only communicates loss values, and thus has very low communication overhead. Moreover, a third party is unable to estimate gradients without knowing the pre-shared seed, which protects data privacy. Experimental results demonstrate FedES can achieve the above benefits while keeping convergence performance the same as that with back propagation methods.  ( 2 min )
    A Generative Neural Network Approach for 3D Multi-Criteria Design Generation and Optimization of an Engine Mount for an Unmanned Air Vehicle. (arXiv:2311.03414v1 [cs.LG])
    One of the most promising developments in computer vision in recent years is the use of generative neural networks for functionality condition-based 3D design reconstruction and generation. Here, neural networks learn dependencies between functionalities and a geometry in a very effective way. For a neural network the functionalities are translated in conditions to a certain geometry. But the more conditions the design generation needs to reflect, the more difficult it is to learn clear dependencies. This leads to a multi criteria design problem due various conditions, which are not considered in the neural network structure so far. In this paper, we address this multi-criteria challenge for a 3D design use case related to an unmanned aerial vehicle (UAV) motor mount. We generate 10,000 abstract 3D designs and subject them all to simulations for three physical disciplines: mechanics, thermodynamics, and aerodynamics. Then, we train a Conditional Variational Autoencoder (CVAE) using the geometry and corresponding multicriteria functional constraints as input. We use our trained CVAE as well as the Marching cubes algorithm to generate meshes for simulation based evaluation. The results are then evaluated with the generated UAV designs. Subsequently, we demonstrate the ability to generate optimized designs under self-defined functionality conditions using the trained neural network.  ( 3 min )
    Blocked Collaborative Bandits: Online Collaborative Filtering with Per-Item Budget Constraints. (arXiv:2311.03376v1 [cs.IR])
    We consider the problem of \emph{blocked} collaborative bandits where there are multiple users, each with an associated multi-armed bandit problem. These users are grouped into \emph{latent} clusters such that the mean reward vectors of users within the same cluster are identical. Our goal is to design algorithms that maximize the cumulative reward accrued by all the users over time, under the \emph{constraint} that no arm of a user is pulled more than $\mathsf{B}$ times. This problem has been originally considered by \cite{Bresler:2014}, and designing regret-optimal algorithms for it has since remained an open problem. In this work, we propose an algorithm called \texttt{B-LATTICE} (Blocked Latent bAndiTs via maTrIx ComplEtion) that collaborates across users, while simultaneously satisfying the budget constraints, to maximize their cumulative rewards. Theoretically, under certain reasonable assumptions on the latent structure, with $\mathsf{M}$ users, $\mathsf{N}$ arms, $\mathsf{T}$ rounds per user, and $\mathsf{C}=O(1)$ latent clusters, \texttt{B-LATTICE} achieves a per-user regret of $\widetilde{O}(\sqrt{\mathsf{T}(1 + \mathsf{N}\mathsf{M}^{-1})}$ under a budget constraint of $\mathsf{B}=\Theta(\log \mathsf{T})$. These are the first sub-linear regret bounds for this problem, and match the minimax regret bounds when $\mathsf{B}=\mathsf{T}$. Empirically, we demonstrate that our algorithm has superior performance over baselines even when $\mathsf{B}=1$. \texttt{B-LATTICE} runs in phases where in each phase it clusters users into groups and collaborates across users within a group to quickly learn their reward models.  ( 2 min )
    Multi-Resolution Diffusion for Privacy-Sensitive Recommender Systems. (arXiv:2311.03488v1 [cs.IR])
    While recommender systems have become an integral component of the Web experience, their heavy reliance on user data raises privacy and security concerns. Substituting user data with synthetic data can address these concerns, but accurately replicating these real-world datasets has been a notoriously challenging problem. Recent advancements in generative AI have demonstrated the impressive capabilities of diffusion models in generating realistic data across various domains. In this work we introduce a Score-based Diffusion Recommendation Model (SDRM), which captures the intricate patterns of real-world datasets required for training highly accurate recommender systems. SDRM allows for the generation of synthetic data that can replace existing datasets to preserve user privacy, or augment existing datasets to address excessive data sparsity. Our method outperforms competing baselines such as generative adversarial networks, variational autoencoders, and recently proposed diffusion models in synthesizing various datasets to replace or augment the original data by an average improvement of 4.30% in Recall@$n$ and 4.65% in NDCG@$n$.  ( 2 min )
    Attention-based Models for Snow-Water Equivalent Prediction. (arXiv:2311.03388v1 [cs.LG])
    Snow Water-Equivalent (SWE) -- the amount of water available if snowpack is melted -- is a key decision variable used by water management agencies to make irrigation, flood control, power generation and drought management decisions. SWE values vary spatiotemporally -- affected by weather, topography and other environmental factors. While daily SWE can be measured by Snow Telemetry (SNOTEL) stations with requisite instrumentation, such stations are spatially sparse requiring interpolation techniques to create spatiotemporally complete data. While recent efforts have explored machine learning (ML) for SWE prediction, a number of recent ML advances have yet to be considered. The main contribution of this paper is to explore one such ML advance, attention mechanisms, for SWE prediction. Our hypothesis is that attention has a unique ability to capture and exploit correlations that may exist across locations or the temporal spectrum (or both). We present a generic attention-based modeling framework for SWE prediction and adapt it to capture spatial attention and temporal attention. Our experimental results on 323 SNOTEL stations in the Western U.S. demonstrate that our attention-based models outperform other machine learning approaches. We also provide key results highlighting the differences between spatial and temporal attention in this context and a roadmap toward deployment for generating spatially-complete SWE maps.  ( 2 min )
    InterVLS: Interactive Model Understanding and Improvement with Vision-Language Surrogates. (arXiv:2311.03547v1 [cs.AI])
    Deep learning models are widely used in critical applications, highlighting the need for pre-deployment model understanding and improvement. Visual concept-based methods, while increasingly used for this purpose, face challenges: (1) most concepts lack interpretability, (2) existing methods require model knowledge, often unavailable at run time. Additionally, (3) there lacks a no-code method for post-understanding model improvement. Addressing these, we present InterVLS. The system facilitates model understanding by discovering text-aligned concepts, measuring their influence with model-agnostic linear surrogates. Employing visual analytics, InterVLS offers concept-based explanations and performance insights. It enables users to adjust concept influences to update a model, facilitating no-code model improvement. We evaluate InterVLS in a user study, illustrating its functionality with two scenarios. Results indicates that InterVLS is effective to help users identify influential concepts to a model, gain insights and adjust concept influence to improve the model. We conclude with a discussion based on our study results.  ( 2 min )
    Exploring Latent Spaces of Tonal Music using Variational Autoencoders. (arXiv:2311.03621v1 [cs.SD])
    Variational Autoencoders (VAEs) have proven to be effective models for producing latent representations of cognitive and semantic value. We assess the degree to which VAEs trained on a prototypical tonal music corpus of 371 Bach's chorales define latent spaces representative of the circle of fifths and the hierarchical relation of each key component pitch as drawn in music cognition. In detail, we compare the latent space of different VAE corpus encodings -- Piano roll, MIDI, ABC, Tonnetz, DFT of pitch, and pitch class distributions -- in providing a pitch space for key relations that align with cognitive distances. We evaluate the model performance of these encodings using objective metrics to capture accuracy, mean square error (MSE), KL-divergence, and computational cost. The ABC encoding performs the best in reconstructing the original data, while the Pitch DFT seems to capture more information from the latent space. Furthermore, an objective evaluation of 12 major or minor transpositions per piece is adopted to quantify the alignment of 1) intra- and inter-segment distances per key and 2) the key distances to cognitive pitch spaces. Our results show that Pitch DFT VAE latent spaces align best with cognitive spaces and provide a common-tone space where overlapping objects within a key are fuzzy clusters, which impose a well-defined order of structural significance or stability -- i.e., a tonal hierarchy. Tonal hierarchies of different keys can be used to measure key distances and the relationships of their in-key components at multiple hierarchies (e.g., notes and chords). The implementation of our VAE and the encodings framework are made available online.  ( 3 min )
    FinA: Fairness of Adverse Effects in Decision-Making of Human-Cyber-Physical-System. (arXiv:2311.03468v1 [cs.AI])
    Ensuring fairness in decision-making systems within Human-Cyber-Physical-Systems (HCPS) is a pressing concern, particularly when diverse individuals, each with varying behaviors and expectations, coexist within the same application space, influenced by a shared set of control actions in the system. The long-term adverse effects of these actions further pose the challenge, as historical experiences and interactions shape individual perceptions of fairness. This paper addresses the challenge of fairness from an equity perspective of adverse effects, taking into account the dynamic nature of human behavior and evolving preferences while recognizing the lasting impact of adverse effects. We formally introduce the concept of Fairness-in-Adverse-Effects (FinA) within the HCPS context. We put forth a comprehensive set of five formulations for FinA, encompassing both the instantaneous and long-term aspects of adverse effects. To empirically validate the effectiveness of our FinA approach, we conducted an evaluation within the domain of smart homes, a pertinent HCPS application. The outcomes of our evaluation demonstrate that the adoption of FinA significantly enhances the overall perception of fairness among individuals, yielding an average improvement of 66.7% when compared to the state-of-the-art method.  ( 2 min )
    Discret2Di -- Deep Learning based Discretization for Model-based Diagnosis. (arXiv:2311.03413v1 [cs.LG])
    Consistency-based diagnosis is an established approach to diagnose technical applications, but suffers from significant modeling efforts, especially for dynamic multi-modal time series. Machine learning seems to be an obvious solution, which becomes less obvious when looking at details: Which notion of consistency can be used? If logical calculi are still to be used, how can dynamic time series be transferred into the discrete world? This paper presents the methodology Discret2Di for automated learning of logical expressions for consistency-based diagnosis. While these logical calculi have advantages by providing a clear notion of consistency, they have the key problem of relying on a discretization of the dynamic system. The solution presented combines machine learning from both the time series and the symbolic domain to automate the learning of logical rules for consistency-based diagnosis.  ( 2 min )
    Low-Rank MDPs with Continuous Action Spaces. (arXiv:2311.03564v1 [cs.LG])
    Low-Rank Markov Decision Processes (MDPs) have recently emerged as a promising framework within the domain of reinforcement learning (RL), as they allow for provably approximately correct (PAC) learning guarantees while also incorporating ML algorithms for representation learning. However, current methods for low-rank MDPs are limited in that they only consider finite action spaces, and give vacuous bounds as $|\mathcal{A}| \to \infty$, which greatly limits their applicability. In this work, we study the problem of extending such methods to settings with continuous actions, and explore multiple concrete approaches for performing this extension. As a case study, we consider the seminal FLAMBE algorithm (Agarwal et al., 2020), which is a reward-agnostic method for PAC RL with low-rank MDPs. We show that, without any modifications to the algorithm, we obtain similar PAC bound when actions are allowed to be continuous. Specifically, when the model for transition functions satisfies a Holder smoothness condition w.r.t. actions, and either the policy class has a uniformly bounded minimum density or the reward function is also Holder smooth, we obtain a polynomial PAC bound that depends on the order of smoothness.  ( 2 min )
    DP-DCAN: Differentially Private Deep Contrastive Autoencoder Network for Single-cell Clustering. (arXiv:2311.03410v1 [cs.LG])
    Single-cell RNA sequencing (scRNA-seq) is important to transcriptomic analysis of gene expression. Recently, deep learning has facilitated the analysis of high-dimensional single-cell data. Unfortunately, deep learning models may leak sensitive information about users. As a result, Differential Privacy (DP) is increasingly used to protect privacy. However, existing DP methods usually perturb whole neural networks to achieve differential privacy, and hence result in great performance overheads. To address this challenge, in this paper, we take advantage of the uniqueness of the autoencoder that it outputs only the dimension-reduced vector in the middle of the network, and design a Differentially Private Deep Contrastive Autoencoder Network (DP-DCAN) by partial network perturbation for single-cell clustering. Since only partial network is added with noise, the performance improvement is obvious and twofold: one part of network is trained with less noise due to a bigger privacy budget, and the other part is trained without any noise. Experimental results of six datasets have verified that DP-DCAN is superior to the traditional DP scheme with whole network perturbation. Moreover, DP-DCAN demonstrates strong robustness to adversarial attacks. The code is available at https://github.com/LFD-byte/DP-DCAN.  ( 2 min )
    CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers. (arXiv:2311.03615v1 [cs.LG])
    Training large-scale artificial intelligence (AI) models demands significant computational power and energy, leading to increased carbon footprint with potential environmental repercussions. This paper delves into the challenges of training AI models across geographically distributed (geo-distributed) data centers, emphasizing the balance between learning performance and carbon footprint. We consider Federated Learning (FL) as a solution, which prioritizes model parameter exchange over raw data, ensuring data privacy and compliance with local regulations. Given the variability in carbon intensity across regions, we propose a new framework called CAFE (short for Carbon-Aware Federated Learning) to optimize training within a fixed carbon footprint budget. Our approach incorporates coreset selection to assess learning performance, employs the Lyapunov drift-plus-penalty framework to address the unpredictability of future carbon intensity, and devises an efficient algorithm to address the combinatorial complexity of the data center selection. Through extensive simulations using real-world carbon intensity data, we demonstrate the efficacy of our algorithm, highlighting its superiority over existing methods in optimizing learning performance while minimizing environmental impact.  ( 2 min )
    An AI-Guided Data Centric Strategy to Detect and Mitigate Biases in Healthcare Datasets. (arXiv:2311.03425v1 [cs.LG])
    The adoption of diagnosis and prognostic algorithms in healthcare has led to concerns about the perpetuation of bias against disadvantaged groups of individuals. Deep learning methods to detect and mitigate bias have revolved around modifying models, optimization strategies, and threshold calibration with varying levels of success. Here, we generate a data-centric, model-agnostic, task-agnostic approach to evaluate dataset bias by investigating the relationship between how easily different groups are learned at small sample sizes (AEquity). We then apply a systematic analysis of AEq values across subpopulations to identify and mitigate manifestations of racial bias in two known cases in healthcare - Chest X-rays diagnosis with deep convolutional neural networks and healthcare utilization prediction with multivariate logistic regression. AEq is a novel and broadly applicable metric that can be applied to advance equity by diagnosing and remediating bias in healthcare datasets.  ( 2 min )
    ViDa: Visualizing DNA hybridization trajectories with biophysics-informed deep graph embeddings. (arXiv:2311.03411v1 [q-bio.QM])
    Visualization tools can help synthetic biologists and molecular programmers understand the complex reactive pathways of nucleic acid reactions, which can be designed for many potential applications and can be modelled using a continuous-time Markov chain (CTMC). Here we present ViDa, a new visualization approach for DNA reaction trajectories that uses a 2D embedding of the secondary structure state space underlying the CTMC model. To this end, we integrate a scattering transform of the secondary structure adjacency, a variational autoencoder, and a nonlinear dimensionality reduction method. We augment the training loss with domain-specific supervised terms that capture both thermodynamic and kinetic features. We assess ViDa on two well-studied DNA hybridization reactions. Our results demonstrate that the domain-specific features lead to significant quality improvements over the state-of-the-art in DNA state space visualization, successfully separating different folding pathways and thus providing useful insights into dominant reaction mechanisms.  ( 2 min )
    An attempt to generate new bridge types from latent space of variational autoencoder. (arXiv:2311.03380v1 [cs.LG])
    Try to generate new bridge types using generative artificial intelligence technology. The grayscale images of the bridge facade with the change of component width was rendered by 3dsMax animation software, and then the OpenCV module performed an appropriate amount of geometric transformation (rotation, horizontal scale, vertical scale) to obtain the image dataset of three-span beam bridge, arch bridge, cable-stayed bridge and suspension bridge. Based on Python programming language, TensorFlow and Keras deep learning platform framework, variational autoencoder was constructed and trained, and low-dimensional bridge-type latent space that is convenient for vector operations was obtained. Variational autoencoder can combine two bridge types on the basis of the original of human into one that is a new bridge type. Generative artificial intelligence technology can assist bridge designers in bridge-type innovation, and can be used as copilot.  ( 2 min )
  • Open

    Bandit Pareto Set Identification: the Fixed Budget Setting. (arXiv:2311.03992v1 [stat.ML])
    We study a multi-objective pure exploration problem in a multi-armed bandit model. Each arm is associated to an unknown multi-variate distribution and the goal is to identify the distributions whose mean is not uniformly worse than that of another distribution: the Pareto optimal set. We propose and analyze the first algorithms for the \emph{fixed budget} Pareto Set Identification task. We propose Empirical Gap Elimination, a family of algorithms combining a careful estimation of the ``hardness to classify'' each arm in or out of the Pareto set with a generic elimination scheme. We prove that two particular instances, EGE-SR and EGE-SH, have a probability of error that decays exponentially fast with the budget, with an exponent supported by an information theoretic lower-bound. We complement these findings with an empirical study using real-world and synthetic datasets, which showcase the good performance of our algorithms.  ( 2 min )
    Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization. (arXiv:2311.04163v1 [cs.LG])
    We identify a new phenomenon in neural network optimization which arises from the interaction of depth and a particular heavy-tailed structure in natural data. Our result offers intuitive explanations for several previously reported observations about network training dynamics. In particular, it implies a conceptually new cause for progressive sharpening and the edge of stability; we also highlight connections to other concepts in optimization and generalization including grokking, simplicity bias, and Sharpness-Aware Minimization. Experimentally, we demonstrate the significant influence of paired groups of outliers in the training data with strong opposing signals: consistent, large magnitude features which dominate the network output throughout training and provide gradients which point in opposite directions. Due to these outliers, early optimization enters a narrow valley which carefully balances the opposing groups; subsequent sharpening causes their loss to rise rapidly, oscillating between high on one group and then the other, until the overall loss spikes. We describe how to identify these groups, explore what sets them apart, and carefully study their effect on the network's optimization and behavior. We complement these experiments with a mechanistic explanation on a toy example of opposing signals and a theoretical analysis of a two-layer linear network on a simple model. Our finding enables new qualitative predictions of training behavior which we confirm experimentally. It also provides a new lens through which to study and improve modern training practices for stochastic optimization, which we highlight via a case study of Adam versus SGD.  ( 3 min )
    AdaSub: Stochastic Optimization Using Second-Order Information in Low-Dimensional Subspaces. (arXiv:2310.20060v2 [math.OC] UPDATED)
    We introduce AdaSub, a stochastic optimization algorithm that computes a search direction based on second-order information in a low-dimensional subspace that is defined adaptively based on available current and past information. Compared to first-order methods, second-order methods exhibit better convergence characteristics, but the need to compute the Hessian matrix at each iteration results in excessive computational expenses, making them impractical. To address this issue, our approach enables the management of computational expenses and algorithm efficiency by enabling the selection of the subspace dimension for the search. Our code is freely available on GitHub, and our preliminary numerical results demonstrate that AdaSub surpasses popular stochastic optimizers in terms of time and number of iterations required to reach a given accuracy.  ( 2 min )
    Flow-based distributionally robust optimization. (arXiv:2310.19253v2 [cs.LG] UPDATED)
    We present a computationally efficient framework, called FlowDRO, for solving flow-based distributionally robust optimization (DRO) problems with Wasserstein uncertainty sets while aiming to find continuous worst-case distribution (also called the Least Favorable Distribution, LFD). The requirement for LFD to be continuous is so that the algorithm can be scalable to problems with larger sample sizes and achieve better generalization capability for the induced robust algorithms. To tackle the computationally challenging infinitely dimensional optimization problem, we leverage flow-based models and continuous-time invertible transport maps between the data distribution and the target distribution. We also develop a Wasserstein proximal gradient flow type of algorithm. In theory, we establish the equivalence of the solution by optimal transport map to the original formulation, as well as the dual form of the problem through Wasserstein calculus and Brenier theorem. In practice, we parameterize the transport maps by a sequence of neural networks progressively trained in blocks by gradient descent. Our computational framework is general, can handle high-dimensional data with large sample sizes, and can be useful for various applications. We demonstrate its usage in adversarial learning, distributionally robust hypothesis testing, and a new mechanism for data-driven distribution perturbation differential privacy, where the proposed method gives strong empirical performance on real high-dimensional data.  ( 2 min )
    Non-Asymptotic Analysis of a UCB-based Top Two Algorithm. (arXiv:2210.05431v3 [stat.ML] UPDATED)
    A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance.  ( 2 min )
    A Corrected Expected Improvement Acquisition Function Under Noisy Observations. (arXiv:2310.05166v2 [cs.LG] UPDATED)
    Sequential maximization of expected improvement (EI) is one of the most widely used policies in Bayesian optimization because of its simplicity and ability to handle noisy observations. In particular, the improvement function often uses the best posterior mean as the best incumbent in noisy settings. However, the uncertainty associated with the incumbent solution is often neglected in many analytic EI-type methods: a closed-form acquisition function is derived in the noise-free setting, but then applied to the setting with noisy observations. To address this limitation, we propose a modification of EI that corrects its closed-form expression by incorporating the covariance information provided by the Gaussian Process (GP) model. This acquisition function specializes to the classical noise-free result, and we argue should replace that formula in Bayesian optimization software packages, tutorials, and textbooks. This enhanced acquisition provides good generality for noisy and noiseless settings. We show that our method achieves a sublinear convergence rate on the cumulative regret bound under heteroscedastic observation noise. Our empirical results demonstrate that our proposed acquisition function can outperform EI in the presence of noisy observations on benchmark functions for black-box optimization, as well as on parameter search for neural network model compression.  ( 2 min )
    Learning-Based Optimal Control with Performance Guarantees for Unknown Systems with Latent States. (arXiv:2303.17963v2 [eess.SY] UPDATED)
    As control engineering methods are applied to increasingly complex systems, data-driven approaches for system identification appear as a promising alternative to physics-based modeling. While the Bayesian approaches prevalent for safety-critical applications usually rely on the availability of state measurements, the states of a complex system are often not directly measurable. It may then be necessary to jointly estimate the dynamics and the latent state, making the quantification of uncertainties and the design of controllers with formal performance guarantees considerably more challenging. This paper proposes a novel method for the computation of an optimal input trajectory for unknown nonlinear systems with latent states based on a combination of particle Markov chain Monte Carlo methods and scenario theory. Probabilistic performance guarantees are derived for the resulting input trajectory, and an approach to validate the performance of arbitrary control laws is presented. The effectiveness of the proposed method is demonstrated in a numerical simulation.  ( 2 min )
    Posterior Sampling-Based Bayesian Optimization with Tighter Bayesian Regret Bounds. (arXiv:2311.03760v1 [cs.LG])
    Among various acquisition functions (AFs) in Bayesian optimization (BO), Gaussian process upper confidence bound (GP-UCB) and Thompson sampling (TS) are well-known options with established theoretical properties regarding Bayesian cumulative regret (BCR). Recently, it has been shown that a randomized variant of GP-UCB achieves a tighter BCR bound compared with GP-UCB, which we call the tighter BCR bound for brevity. Inspired by this study, this paper first shows that TS achieves the tighter BCR bound. On the other hand, GP-UCB and TS often practically suffer from manual hyperparameter tuning and over-exploration issues, respectively. To overcome these difficulties, we propose yet another AF called a probability of improvement from the maximum of a sample path (PIMS). We show that PIMS achieves the tighter BCR bound and avoids the hyperparameter tuning, unlike GP-UCB. Furthermore, we demonstrate a wide range of experiments, focusing on the effectiveness of PIMS that mitigates the practical issues of GP-UCB and TS.  ( 2 min )
    Inference via robust optimal transportation: theory and methods. (arXiv:2301.06297v2 [math.ST] UPDATED)
    Optimal transport (OT) theory and the related $p$-Wasserstein distance ($W_p$, $p\geq 1$) are widely-applied in statistics and machine learning. In spite of their popularity, inference based on these tools is sensitive to outliers or it can perform poorly when the underlying model has heavy-tails. To cope with these issues, we introduce a new class of procedures. (i) We consider a robust version of the primal OT problem (ROBOT) and show that it defines the {robust Wasserstein distance}, $W^{(\lambda)}$, which depends on a tuning parameter $\lambda > 0$. (ii) We illustrate the link between $W_1$ and $W^{(\lambda)}$ and study its key measure theoretic aspects. (iii) We derive some concentration inequalities for $W^{(\lambda)}$. (iii) We use $W^{(\lambda)}$ to define minimum distance estimators, we provide their statistical guarantees and we illustrate how to apply concentration inequalities for the selection of $\lambda$. (v) We derive the {dual} form of the ROBOT and illustrate its applicability to machine learning problems (generative adversarial networks and domain adaptation). Numerical exercises provide evidence of the benefits yielded by our methods.  ( 2 min )
    Learning Proposals for Practical Energy-Based Regression. (arXiv:2110.11948v2 [cs.LG] UPDATED)
    Energy-based models (EBMs) have experienced a resurgence within machine learning in recent years, including as a promising alternative for probabilistic regression. However, energy-based regression requires a proposal distribution to be manually designed for training, and an initial estimate has to be provided at test-time. We address both of these issues by introducing a conceptually simple method to automatically learn an effective proposal distribution, which is parameterized by a separate network head. To this end, we derive a surprising result, leading to a unified training objective that jointly minimizes the KL divergence from the proposal to the EBM, and the negative log-likelihood of the EBM. At test-time, we can then employ importance sampling with the trained proposal to efficiently evaluate the learned EBM and produce stand-alone predictions. Furthermore, we utilize our derived training objective to learn mixture density networks (MDNs) with a jointly trained energy-based teacher, consistently outperforming conventional MDN training on four real-world regression tasks within computer vision. Code is available at https://github.com/fregu856/ebms_proposals.  ( 2 min )
    Maximum Mean Discrepancy Meets Neural Networks: The Radon-Kolmogorov-Smirnov Test. (arXiv:2309.02422v3 [stat.ML] UPDATED)
    Maximum mean discrepancy (MMD) refers to a general class of nonparametric two-sample tests that are based on maximizing the mean difference over samples from one distribution $P$ versus another $Q$, over all choices of data transformations $f$ living in some function space $\mathcal{F}$. Inspired by recent work that connects what are known as functions of $\textit{Radon bounded variation}$ (RBV) and neural networks (Parhi and Nowak, 2021, 2023), we study the MMD defined by taking $\mathcal{F}$ to be the unit ball in the RBV space of a given smoothness order $k \geq 0$. This test, which we refer to as the $\textit{Radon-Kolmogorov-Smirnov}$ (RKS) test, can be viewed as a generalization of the well-known and classical Kolmogorov-Smirnov (KS) test to multiple dimensions and higher orders of smoothness. It is also intimately connected to neural networks: we prove that the witness in the RKS test -- the function $f$ achieving the maximum mean difference -- is always a ridge spline of degree $k$, i.e., a single neuron in a neural network. This allows us to leverage the power of modern deep learning toolkits to (approximately) optimize the criterion that underlies the RKS test. We prove that the RKS test has asymptotically full power at distinguishing any distinct pair $P \not= Q$ of distributions, derive its asymptotic null distribution, and carry out extensive experiments to elucidate the strengths and weakenesses of the RKS test versus the more traditional kernel MMD test.  ( 3 min )
    Dynamics of Finite Width Kernel and Prediction Fluctuations in Mean Field Neural Networks. (arXiv:2304.03408v3 [stat.ML] UPDATED)
    We analyze the dynamics of finite width effects in wide but finite feature learning neural networks. Starting from a dynamical mean field theory description of infinite width deep neural network kernel and prediction dynamics, we provide a characterization of the $O(1/\sqrt{\text{width}})$ fluctuations of the DMFT order parameters over random initializations of the network weights. Our results, while perturbative in width, unlike prior analyses, are non-perturbative in the strength of feature learning. In the lazy limit of network training, all kernels are random but static in time and the prediction variance has a universal form. However, in the rich, feature learning regime, the fluctuations of the kernels and predictions are dynamically coupled with a variance that can be computed self-consistently. In two layer networks, we show how feature learning can dynamically reduce the variance of the final tangent kernel and final network predictions. We also show how initialization variance can slow down online learning in wide but finite networks. In deeper networks, kernel variance can dramatically accumulate through subsequent layers at large feature learning strengths, but feature learning continues to improve the signal-to-noise ratio of the feature kernels. In discrete time, we demonstrate that large learning rate phenomena such as edge of stability effects can be well captured by infinite width dynamics and that initialization variance can decrease dynamically. For CNNs trained on CIFAR-10, we empirically find significant corrections to both the bias and variance of network dynamics due to finite width.  ( 3 min )
    Convergence Analysis of Mean Shift. (arXiv:2305.08463v3 [stat.ML] UPDATED)
    The mean shift (MS) algorithm seeks a mode of the kernel density estimate (KDE). This study presents a convergence guarantee of the mode estimate sequence generated by the MS algorithm and an evaluation of the convergence rate, under fairly mild conditions, with the help of the argument concerning the {\L}ojasiewicz inequality. Our findings extend existing ones covering analytic kernels and the Epanechnikov kernel. Those are significant in that they cover the biweight kernel, which is optimal among non-negative kernels in terms of the asymptotic statistical efficiency for the KDE-based mode estimation.  ( 2 min )
    Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case. (arXiv:2208.14960v3 [stat.ME] UPDATED)
    Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.  ( 3 min )
    CeCNN: Copula-enhanced convolutional neural networks in joint prediction of refraction error and axial length based on ultra-widefield fundus images. (arXiv:2311.03967v1 [cs.CV])
    Ultra-widefield (UWF) fundus images are replacing traditional fundus images in screening, detection, prediction, and treatment of complications related to myopia because their much broader visual range is advantageous for highly myopic eyes. Spherical equivalent (SE) is extensively used as the main myopia outcome measure, and axial length (AL) has drawn increasing interest as an important ocular component for assessing myopia. Cutting-edge studies show that SE and AL are strongly correlated. Using the joint information from SE and AL is potentially better than using either separately. In the deep learning community, though there is research on multiple-response tasks with a 3D image biomarker, dependence among responses is only sporadically taken into consideration. Inspired by the spirit that information extracted from the data by statistical methods can improve the prediction accuracy of deep learning models, we formulate a class of multivariate response regression models with a higher-order tensor biomarker, for the bivariate tasks of regression-classification and regression-regression. Specifically, we propose a copula-enhanced convolutional neural network (CeCNN) framework that incorporates the dependence between responses through a Gaussian copula (with parameters estimated from a warm-up CNN) and uses the induced copula-likelihood loss with the backbone CNNs. We establish the statistical framework and algorithms for the aforementioned two bivariate tasks. We show that the CeCNN has better prediction accuracy after adding the dependency information to the backbone models. The modeling and the proposed CeCNN algorithm are applicable beyond the UWF scenario and can be effective with other backbones beyond ResNet and LeNet.  ( 3 min )
    Birth of a Transformer: A Memory Viewpoint. (arXiv:2306.00802v2 [stat.ML] UPDATED)
    Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.  ( 2 min )
    Accurate 3D Object Detection using Energy-Based Models. (arXiv:2012.04634v2 [cs.CV] UPDATED)
    Accurate 3D object detection (3DOD) is crucial for safe navigation of complex environments by autonomous robots. Regressing accurate 3D bounding boxes in cluttered environments based on sparse LiDAR data is however a highly challenging problem. We address this task by exploring recent advances in conditional energy-based models (EBMs) for probabilistic regression. While methods employing EBMs for regression have demonstrated impressive performance on 2D object detection in images, these techniques are not directly applicable to 3D bounding boxes. In this work, we therefore design a differentiable pooling operator for 3D bounding boxes, serving as the core module of our EBM network. We further integrate this general approach into the state-of-the-art 3D object detector SA-SSD. On the KITTI dataset, our proposed approach consistently outperforms the SA-SSD baseline across all 3DOD metrics, demonstrating the potential of EBM-based regression for highly accurate 3DOD. Code is available at https://github.com/fregu856/ebms_3dod.  ( 2 min )
    Exact Bayesian Inference on Discrete Models via Probability Generating Functions: A Probabilistic Programming Approach. (arXiv:2305.17058v3 [cs.PL] UPDATED)
    We present an exact Bayesian inference method for discrete statistical models, which can find exact solutions to a large class of discrete inference problems, even with infinite support and continuous priors. To express such models, we introduce a probabilistic programming language that supports discrete and continuous sampling, discrete observations, affine functions, (stochastic) branching, and conditioning on discrete events. Our key tool is probability generating functions: they provide a compact closed-form representation of distributions that are definable by programs, thus enabling the exact computation of posterior probabilities, expectation, variance, and higher moments. Our inference method is provably correct and fully automated in a tool called Genfer, which uses automatic differentiation (specifically, Taylor polynomials), but does not require computer algebra. Our experiments show that Genfer is often faster than the existing exact inference tools PSI, Dice, and Prodigy. On a range of real-world inference problems that none of these exact tools can solve, Genfer's performance is competitive with approximate Monte Carlo methods, while avoiding approximation errors.  ( 2 min )
    Computing Approximate $\ell_p$ Sensitivities. (arXiv:2311.04158v1 [cs.LG])
    Recent works in dimensionality reduction for regression tasks have introduced the notion of sensitivity, an estimate of the importance of a specific datapoint in a dataset, offering provable guarantees on the quality of the approximation after removing low-sensitivity datapoints via subsampling. However, fast algorithms for approximating $\ell_p$ sensitivities, which we show is equivalent to approximate $\ell_p$ regression, are known for only the $\ell_2$ setting, in which they are termed leverage scores. In this work, we provide efficient algorithms for approximating $\ell_p$ sensitivities and related summary statistics of a given matrix. In particular, for a given $n \times d$ matrix, we compute $\alpha$-approximation to its $\ell_1$ sensitivities at the cost of $O(n/\alpha)$ sensitivity computations. For estimating the total $\ell_p$ sensitivity (i.e. the sum of $\ell_p$ sensitivities), we provide an algorithm based on importance sampling of $\ell_p$ Lewis weights, which computes a constant factor approximation to the total sensitivity at the cost of roughly $O(\sqrt{d})$ sensitivity computations. Furthermore, we estimate the maximum $\ell_1$ sensitivity, up to a $\sqrt{d}$ factor, using $O(d)$ sensitivity computations. We generalize all these results to $\ell_p$ norms for $p > 1$. Lastly, we experimentally show that for a wide class of matrices in real-world datasets, the total sensitivity can be quickly approximated and is significantly smaller than the theoretical prediction, demonstrating that real-world datasets have low intrinsic effective dimensionality.  ( 2 min )
    Discordance Minimization-based Imputation Algorithms for Missing Values in Rating Data. (arXiv:2311.04035v1 [stat.ML])
    Ratings are frequently used to evaluate and compare subjects in various applications, from education to healthcare, because ratings provide succinct yet credible measures for comparing subjects. However, when multiple rating lists are combined or considered together, subjects often have missing ratings, because most rating lists do not rate every subject in the combined list. In this study, we propose analyses on missing value patterns using six real-world data sets in various applications, as well as the conditions for applicability of imputation algorithms. Based on the special structures and properties derived from the analyses, we propose optimization models and algorithms that minimize the total rating discordance across rating providers to impute missing ratings in the combined rating lists, using only the known rating information. The total rating discordance is defined as the sum of the pairwise discordance metric, which can be written as a quadratic function. Computational experiments based on real-world and synthetic rating data sets show that the proposed methods outperform the state-of-the-art general imputation methods in the literature in terms of imputation accuracy.  ( 2 min )
    Interaction Measures, Partition Lattices and Kernel Tests for High-Order Interactions. (arXiv:2306.00904v3 [stat.ML] UPDATED)
    Models that rely solely on pairwise relationships often fail to capture the complete statistical structure of the complex multivariate data found in diverse domains, such as socio-economic, ecological, or biomedical systems. Non-trivial dependencies between groups of more than two variables can play a significant role in the analysis and modelling of such systems, yet extracting such high-order interactions from data remains challenging. Here, we introduce a hierarchy of $d$-order ($d \geq 2$) interaction measures, increasingly inclusive of possible factorisations of the joint probability distribution, and define non-parametric, kernel-based tests to establish systematically the statistical significance of $d$-order interactions. We also establish mathematical links with lattice theory, which elucidate the derivation of the interaction measures and their composite permutation tests; clarify the connection of simplicial complexes with kernel matrix centring; and provide a means to enhance computational efficiency. We illustrate our results numerically with validations on synthetic data, and through an application to neuroimaging data.  ( 2 min )
    Additive Covariance Matrix Models: Modelling Regional Electricity Net-Demand in Great Britain. (arXiv:2211.07451v2 [stat.AP] UPDATED)
    Forecasts of regional electricity net-demand, consumption minus embedded generation, are an essential input for reliable and economic power system operation, and energy trading. While such forecasts are typically performed region by region, operations such as managing power flows require spatially coherent joint forecasts, which account for cross-regional dependencies. Here, we forecast the joint distribution of net-demand across the 14 regions constituting Great Britain's electricity network. Joint modelling is complicated by the fact that the net-demand variability within each region, and the dependencies between regions, vary with temporal, socio-economical and weather-related factors. We accommodate for these characteristics by proposing a multivariate Gaussian model based on a modified Cholesky parametrisation, which allows us to model each unconstrained parameter via an additive model. Given that the number of model parameters and covariates is large, we adopt a semi-automated approach to model selection, based on gradient boosting. In addition to comparing the forecasting performance of several versions of the proposed model with that of two non-Gaussian copula-based models, we visually explore the model output to interpret how the covariates affect net-demand variability and dependencies. The code for reproducing the results in this paper is available at https://doi.org/10.5281/zenodo.7315105, while methods for building and fitting multivariate Gaussian additive models are provided by the SCM R package, available at https://github.com/VinGioia90/SCM.  ( 2 min )
    On the Impact of Overparameterization on the Training of a Shallow Neural Network in High Dimensions. (arXiv:2311.03794v1 [math.OC])
    We study the training dynamics of a shallow neural network with quadratic activation functions and quadratic cost in a teacher-student setup. In line with previous works on the same neural architecture, the optimization is performed following the gradient flow on the population risk, where the average over data points is replaced by the expectation over their distribution, assumed to be Gaussian.We first derive convergence properties for the gradient flow and quantify the overparameterization that is necessary to achieve a strong signal recovery. Then, assuming that the teachers and the students at initialization form independent orthonormal families, we derive a high-dimensional limit for the flow and show that the minimal overparameterization is sufficient for strong recovery. We verify by numerical experiments that these results hold for more general initializations.  ( 2 min )
    Hypergraphs with node attributes: structure and inference. (arXiv:2311.03857v1 [cs.SI])
    Many networked datasets with units interacting in groups of two or more, encoded with hypergraphs, are accompanied by extra information about nodes, such as the role of an individual in a workplace. Here we show how these node attributes can be used to improve our understanding of the structure resulting from higher-order interactions. We consider the problem of community detection in hypergraphs and develop a principled model that combines higher-order interactions and node attributes to better represent the observed interactions and to detect communities more accurately than using either of these types of information alone. The method learns automatically from the input data the extent to which structure and attributes contribute to explain the data, down weighing or discarding attributes if not informative. Our algorithmic implementation is efficient and scales to large hypergraphs and interactions of large numbers of units. We apply our method to a variety of systems, showing strong performance in hyperedge prediction tasks and in selecting community divisions that correlate with attributes when these are informative, but discarding them otherwise. Our approach illustrates the advantage of using informative node attributes when available with higher-order data.  ( 2 min )
    How to Scale Your EMA. (arXiv:2307.13813v3 [stat.ML] UPDATED)
    Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functional copy of a target model, whose parameters move towards those of its target model according to an Exponential Moving Average (EMA) at a rate parameterized by a momentum hyperparameter. This model EMA can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL). Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics across batch sizes and lower model performance. In this work, we provide a scaling rule for optimization in the presence of a model EMA and demonstrate the rule's validity across a range of architectures, optimizers, and data modalities. We also show the rule's validity where the model EMA contributes to the optimization of the target model, enabling us to train EMA-based pseudo-labeling and SSL methods at small and large batch sizes. For SSL, we enable training of BYOL up to batch size 24,576 without sacrificing performance, a 6$\times$ wall-clock time reduction under idealized hardware settings.  ( 3 min )
    Doubly Robust Kernel Statistics for Testing Distributional Treatment Effects. (arXiv:2212.04922v2 [stat.ML] UPDATED)
    With the widespread application of causal inference, it is increasingly important to have tools which can test for the presence of causal effects in a diverse array of circumstances. In this vein we focus on the problem of testing for \emph{distributional} causal effects, where the treatment affects not just the mean, but also higher order moments of the distribution, as well as multidimensional or structured outcomes. We build upon a previously introduced framework, Counterfactual Mean Embeddings, for representing causal distributions within Reproducing Kernel Hilbert Spaces (RKHS) by proposing new, improved, estimators for the distributional embeddings. These improved estimators are inspired by doubly robust estimators of the causal mean, using a similar form within the kernel space. We analyse these estimators, proving they retain the doubly robust property and have improved convergence rates compared to the original estimators. This leads to new permutation based tests for distributional causal effects, using the estimators we propose as tests statistics. We experimentally and theoretically demonstrate the validity of our tests.  ( 2 min )
    Improved particle-flow event reconstruction with scalable neural networks for current and future particle detectors. (arXiv:2309.06782v3 [physics.data-an] UPDATED)
    We study scalable machine learning models for full event reconstruction in high-energy electron-positron collisions based on a highly granular detector simulation. Particle-flow reconstruction can be formulated as a supervised learning task using tracks and calorimeter clusters or hits. We compare a graph neural network and kernel-based transformer and demonstrate that both avoid quadratic memory allocation and computational cost while achieving realistic reconstruction. We show that hyperparameter tuning on a supercomputer significantly enhances the physics performance of the models, improving the jet transverse momentum resolution by up to 50% compared to the baseline. The resulting model is highly portable across hardware processors. Finally, we demonstrate that the model can be trained on highly granular inputs consisting of tracks and calorimeter hits, resulting in a competitive physics performance with the baseline. Datasets and software to reproduce the studies are published following the findable, accessible, interoperable, and reusable principles.  ( 2 min )
    Comparing Causal Frameworks: Potential Outcomes, Structural Models, Graphs, and Abstractions. (arXiv:2306.14351v2 [stat.ME] UPDATED)
    The aim of this paper is to make clear and precise the relationship between the Rubin causal model (RCM) and structural causal model (SCM) frameworks for causal inference. Adopting a neutral logical perspective, and drawing on previous work, we show what is required for an RCM to be representable by an SCM. A key result then shows that every RCM -- including those that violate algebraic principles implied by the SCM framework -- emerges as an abstraction of some representable RCM. Finally, we illustrate the power of this conciliatory perspective by pinpointing an important role for SCM principles in classic applications of RCMs; conversely, we offer a characterization of the algebraic constraints implied by a graph, helping to substantiate further comparisons between the two frameworks.  ( 2 min )
    Sparse Interaction Additive Networks via Feature Interaction Detection and Sparse Selection. (arXiv:2209.09326v2 [cs.LG] UPDATED)
    There is currently a large gap in performance between the statistically rigorous methods like linear regression or additive splines and the powerful deep methods using neural networks. Previous works attempting to close this gap have failed to fully investigate the exponentially growing number of feature combinations which deep networks consider automatically during training. In this work, we develop a tractable selection algorithm to efficiently identify the necessary feature combinations by leveraging techniques in feature interaction detection. Our proposed Sparse Interaction Additive Networks (SIAN) construct a bridge from these simple and interpretable models to fully connected neural networks. SIAN achieves competitive performance against state-of-the-art methods across multiple large-scale tabular datasets and consistently finds an optimal tradeoff between the modeling capacity of neural networks and the generalizability of simpler methods.  ( 2 min )
    Hilbert's projective metric for functions of bounded growth and exponential convergence of Sinkhorn's algorithm. (arXiv:2311.04041v1 [math.PR])
    We study versions of Hilbert's projective metric for spaces of integrable functions of bounded growth. These metrics originate from cones which are relaxations of the cone of all non-negative functions, in the sense that they include all functions having non-negative integral values when multiplied with certain test functions. We show that kernel integral operators are contractions with respect to suitable specifications of such metrics even for kernels which are not bounded away from zero, provided that the decay to zero of the kernel is controlled. As an application to entropic optimal transport, we show exponential convergence of Sinkhorn's algorithm in settings where the marginal distributions have sufficiently light tails compared to the growth of the cost function.  ( 2 min )
    LISBET: a self-supervised Transformer model for the automatic segmentation of social behavior motifs. (arXiv:2311.04069v1 [cs.CV])
    Social behavior, defined as the process by which individuals act and react in response to others, is crucial for the function of societies and holds profound implications for mental health. To fully grasp the intricacies of social behavior and identify potential therapeutic targets for addressing social deficits, it is essential to understand its core principles. Although machine learning algorithms have made it easier to study specific aspects of complex behavior, current methodologies tend to focus primarily on single-animal behavior. In this study, we introduce LISBET (seLf-supervIsed Social BEhavioral Transformer), a model designed to detect and segment social interactions. Our model eliminates the need for feature selection and extensive human annotation by using self-supervised learning to detect and quantify social behaviors from dynamic body parts tracking data. LISBET can be used in hypothesis-driven mode to automate behavior classification using supervised finetuning, and in discovery-driven mode to segment social behavior motifs using unsupervised learning. We found that motifs recognized using the discovery-driven approach not only closely match the human annotations but also correlate with the electrophysiological activity of dopaminergic neurons in the Ventral Tegmental Area (VTA). We hope LISBET will help the community improve our understanding of social behaviors and their neural underpinnings.  ( 2 min )
    Counterfactual Data Augmentation with Contrastive Learning. (arXiv:2311.03630v1 [cs.LG])
    Statistical disparity between distinct treatment groups is one of the most significant challenges for estimating Conditional Average Treatment Effects (CATE). To address this, we introduce a model-agnostic data augmentation method that imputes the counterfactual outcomes for a selected subset of individuals. Specifically, we utilize contrastive learning to learn a representation space and a similarity measure such that in the learned representation space close individuals identified by the learned similarity measure have similar potential outcomes. This property ensures reliable imputation of counterfactual outcomes for the individuals with close neighbors from the alternative treatment group. By augmenting the original dataset with these reliable imputations, we can effectively reduce the discrepancy between different treatment groups, while inducing minimal imputation error. The augmented dataset is subsequently employed to train CATE estimation models. Theoretical analysis and experimental studies on synthetic and semi-synthetic benchmarks demonstrate that our method achieves significant improvements in both performance and robustness to overfitting across state-of-the-art models.  ( 2 min )
    Loss Dynamics of Temporal Difference Reinforcement Learning. (arXiv:2307.04841v2 [stat.ML] UPDATED)
    Reinforcement learning has been successful across several applications in which agents have to learn to act in environments with sparse feedback. However, despite this empirical success there is still a lack of theoretical understanding of how the parameters of reinforcement learning models and the features used to represent states interact to control the dynamics of learning. In this work, we use concepts from statistical physics, to study the typical case learning curves for temporal difference learning of a value function with linear function approximators. Our theory is derived under a Gaussian equivalence hypothesis where averages over the random trajectories are replaced with temporally correlated Gaussian feature averages and we validate our assumptions on small scale Markov Decision Processes. We find that the stochastic semi-gradient noise due to subsampling the space of possible episodes leads to significant plateaus in the value error, unlike in traditional gradient descent dynamics. We study how learning dynamics and plateaus depend on feature structure, learning rate, discount factor, and reward function. We then analyze how strategies like learning rate annealing and reward shaping can favorably alter learning dynamics and plateaus. To conclude, our work introduces new tools to open a new direction towards developing a theory of learning dynamics in reinforcement learning.  ( 2 min )
    The Linear Representation Hypothesis and the Geometry of Large Language Models. (arXiv:2311.03658v1 [cs.CL])
    Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.  ( 2 min )
    Offline Policy Evaluation and Optimization under Confounding. (arXiv:2211.16583v4 [stat.ML] UPDATED)
    Evaluating and optimizing policies in the presence of unobserved confounders is a problem of growing interest in offline reinforcement learning. Using conventional methods for offline RL in the presence of confounding can not only lead to poor decisions and poor policies, but also have disastrous effects in critical applications such as healthcare and education. We map out the landscape of offline policy evaluation for confounded MDPs, distinguishing assumptions on confounding based on whether they are memoryless and on their effect on the data-collection policies. We characterize settings where consistent value estimates are provably not achievable, and provide algorithms with guarantees to instead estimate lower bounds on the value. When consistent estimates are achievable, we provide algorithms for value estimation with sample complexity guarantees. We also present new algorithms for offline policy improvement and prove local convergence guarantees. Finally, we experimentally evaluate our algorithms on both a gridworld environment and a simulated healthcare setting of managing sepsis patients. In gridworld, our model-based method provides tighter lower bounds than existing methods, while in the sepsis simulator, our methods significantly outperform confounder-oblivious benchmarks.  ( 2 min )
    Block majorization-minimization with diminishing radius for constrained nonconvex optimization. (arXiv:2012.03503v5 [math.OC] UPDATED)
    Block majorization-minimization (BMM) is a simple iterative algorithm for nonconvex constrained optimization that sequentially minimizes majorizing surrogates of the objective function in each block coordinate while the other coordinates are held fixed. BMM entails a large class of optimization algorithms such as block coordinate descent and its proximal-point variant, expectation-minimization, and block projected gradient descent. We establish that for general constrained nonconvex optimization, BMM with strongly convex surrogates can produce an $\epsilon$-stationary point within $O(\epsilon^{-2}(\log \epsilon^{-1})^{2})$ iterations and asymptotically converges to the set of stationary points. Furthermore, we propose a trust-region variant of BMM that can handle surrogates that are only convex and still obtain the same iteration complexity and asymptotic stationarity. These results hold robustly even when the convex sub-problems are inexactly solved as long as the optimality gaps are summable. As an application, we show that a regularized version of the celebrated multiplicative update algorithm for nonnegative matrix factorization by Lee and Seung has iteration complexity of $O(\epsilon^{-2}(\log \epsilon^{-1})^{2})$. The same result holds for a wide class of regularized nonnegative tensor decomposition algorithms as well as the classical block projected gradient descent algorithm. These theoretical results are validated through various numerical experiments.  ( 3 min )
    Inexact bilevel stochastic gradient methods for constrained and unconstrained lower-level problems. (arXiv:2110.00604v3 [math.OC] UPDATED)
    Two-level stochastic optimization formulations have become instrumental in a number of machine learning contexts such as continual learning, neural architecture search, adversarial learning, and hyperparameter tuning. Practical stochastic bilevel optimization problems become challenging in optimization or learning scenarios where the number of variables is high or there are constraints. In this paper, we introduce a bilevel stochastic gradient method for bilevel problems with nonlinear and possibly nonconvex lower-level constraints. We also present a comprehensive convergence theory that addresses both the lower-level unconstrained and constrained cases and covers all inexact calculations of the adjoint gradient (also called hypergradient), such as the inexact solution of the lower-level problem, inexact computation of the adjoint formula (due to the inexact solution of the adjoint equation or use of a truncated Neumann series), and noisy estimates of the gradients, Hessians, and Jacobians involved. To promote the use of bilevel optimization in large-scale learning, we have developed new low-rank practical bilevel stochastic gradient methods (BSG-N-FD and~BSG-1) that do not require second-order derivatives and, in the lower-level unconstrained case, dismiss any matrix-vector products.  ( 2 min )
    Optimizing Solution-Samplers for Combinatorial Problems: The Landscape of Policy-Gradient Methods. (arXiv:2310.05309v2 [cs.LG] UPDATED)
    Deep Neural Networks and Reinforcement Learning methods have empirically shown great promise in tackling challenging combinatorial problems. In those methods a deep neural network is used as a solution generator which is then trained by gradient-based methods (e.g., policy gradient) to successively obtain better solution distributions. In this work we introduce a novel theoretical framework for analyzing the effectiveness of such methods. We ask whether there exist generative models that (i) are expressive enough to generate approximately optimal solutions; (ii) have a tractable, i.e, polynomial in the size of the input, number of parameters; (iii) their optimization landscape is benign in the sense that it does not contain sub-optimal stationary points. Our main contribution is a positive answer to this question. Our result holds for a broad class of combinatorial problems including Max- and Min-Cut, Max-$k$-CSP, Maximum-Weight-Bipartite-Matching, and the Traveling Salesman Problem. As a byproduct of our analysis we introduce a novel regularization process over vanilla gradient descent and provide theoretical and experimental evidence that it helps address vanishing-gradient issues and escape bad stationary points.  ( 2 min )
    Kernel-, mean- and noise-marginalised Gaussian processes for exoplanet transits and $H_0$ inference. (arXiv:2311.04153v1 [astro-ph.CO])
    Using a fully Bayesian approach, Gaussian Process regression is extended to include marginalisation over the kernel choice and kernel hyperparameters. In addition, Bayesian model comparison via the evidence enables direct kernel comparison. The calculation of the joint posterior was implemented with a transdimensional sampler which simultaneously samples over the discrete kernel choice and their hyperparameters by embedding these in a higher-dimensional space, from which samples are taken using nested sampling. This method was explored on synthetic data from exoplanet transit light curve simulations. The true kernel was recovered in the low noise region while no kernel was preferred for larger noise. Furthermore, inference of the physical exoplanet hyperparameters was conducted. In the high noise region, either the bias in the posteriors was removed, the posteriors were broadened or the accuracy of the inference was increased. In addition, the uncertainty in mean function predictive distribution increased due to the uncertainty in the kernel choice. Subsequently, the method was extended to marginalisation over mean functions and noise models and applied to the inference of the present-day Hubble parameter, $H_0$, from real measurements of the Hubble parameter as a function of redshift, derived from the cosmologically model-independent cosmic chronometer and {\Lambda}CDM-dependent baryon acoustic oscillation observations. The inferred $H_0$ values from the cosmic chronometers, baryon acoustic oscillations and combined datasets are $H_0$ = 66$\pm$6 km/s/Mpc, $H_0$ = 67$\pm$10 km/s/Mpc and $H_0$ = 69$\pm$6 km/s/Mpc, respectively. The kernel posterior of the cosmic chronometers dataset prefers a non-stationary linear kernel. Finally, the datasets are shown to be not in tension with ln(R)=12.17$\pm$0.02.  ( 3 min )
    Convergence of Adam Under Relaxed Assumptions. (arXiv:2304.13972v3 [math.OC] UPDATED)
    In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. Despite the popularity and efficiency of the Adam algorithm in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this paper, we show that Adam provably converges to $\epsilon$-stationary points with ${O}(\epsilon^{-4})$ gradient complexity under far more realistic conditions. The key to our analysis is a new proof of boundedness of gradients along the optimization trajectory of Adam, under a generalized smoothness assumption according to which the local smoothness (i.e., Hessian norm when it exists) is bounded by a sub-quadratic function of the gradient norm. Moreover, we propose a variance-reduced version of Adam with an accelerated gradient complexity of ${O}(\epsilon^{-3})$.  ( 2 min )
    Manifold learning: what, how, and why. (arXiv:2311.03757v1 [stat.ML])
    Manifold learning (ML), known also as non-linear dimension reduction, is a set of methods to find the low dimensional structure of data. Dimension reduction for large, high dimensional data is not merely a way to reduce the data; the new representations and descriptors obtained by ML reveal the geometric shape of high dimensional point clouds, and allow one to visualize, de-noise and interpret them. This survey presents the principles underlying ML, the representative methods, as well as their statistical foundations from a practicing statistician's perspective. It describes the trade-offs, and what theory tells us about the parameter and algorithmic choices we make in order to obtain reliable conclusions.  ( 2 min )
    Filtered Partial Differential Equations: a robust surrogate constraint in physics-informed deep learning framework. (arXiv:2311.03776v1 [physics.flu-dyn])
    Embedding physical knowledge into neural network (NN) training has been a hot topic. However, when facing the complex real-world, most of the existing methods still strongly rely on the quantity and quality of observation data. Furthermore, the neural networks often struggle to converge when the solution to the real equation is very complex. Inspired by large eddy simulation in computational fluid dynamics, we propose an improved method based on filtering. We analyzed the causes of the difficulties in physics informed machine learning, and proposed a surrogate constraint (filtered PDE, FPDE in short) of the original physical equations to reduce the influence of noisy and sparse observation data. In the noise and sparsity experiment, the proposed FPDE models (which are optimized by FPDE constraints) have better robustness than the conventional PDE models. Experiments demonstrate that the FPDE model can obtain the same quality solution with 100% higher noise and 12% quantity of observation data of the baseline. Besides, two groups of real measurement data are used to show the FPDE improvements in real cases. The final results show that FPDE still gives more physically reasonable solutions when facing the incomplete equation problem and the extremely sparse and high-noise conditions. For combining real-world experiment data into physics-informed training, the proposed FPDE constraint is useful and performs well in two real-world experiments: modeling the blood velocity in vessels and cell migration in scratches.  ( 2 min )
    Joint model for longitudinal and spatio-temporal survival data. (arXiv:2311.04008v1 [q-fin.RM])
    In credit risk analysis, survival models with fixed and time-varying covariates are widely used to predict a borrower's time-to-event. When the time-varying drivers are endogenous, modelling jointly the evolution of the survival time and the endogenous covariates is the most appropriate approach, also known as the joint model for longitudinal and survival data. In addition to the temporal component, credit risk models can be enhanced when including borrowers' geographical information by considering spatial clustering and its variation over time. We propose the Spatio-Temporal Joint Model (STJM) to capture spatial and temporal effects and their interaction. This Bayesian hierarchical joint model reckons the survival effect of unobserved heterogeneity among borrowers located in the same region at a particular time. To estimate the STJM model for large datasets, we consider the Integrated Nested Laplace Approximation (INLA) methodology. We apply the STJM to predict the time to full prepayment on a large dataset of 57,258 US mortgage borrowers with more than 2.5 million observations. Empirical results indicate that including spatial effects consistently improves the performance of the joint model. However, the gains are less definitive when we additionally include spatio-temporal interactions.  ( 2 min )
    Low-Rank MDPs with Continuous Action Spaces. (arXiv:2311.03564v1 [cs.LG])
    Low-Rank Markov Decision Processes (MDPs) have recently emerged as a promising framework within the domain of reinforcement learning (RL), as they allow for provably approximately correct (PAC) learning guarantees while also incorporating ML algorithms for representation learning. However, current methods for low-rank MDPs are limited in that they only consider finite action spaces, and give vacuous bounds as $|\mathcal{A}| \to \infty$, which greatly limits their applicability. In this work, we study the problem of extending such methods to settings with continuous actions, and explore multiple concrete approaches for performing this extension. As a case study, we consider the seminal FLAMBE algorithm (Agarwal et al., 2020), which is a reward-agnostic method for PAC RL with low-rank MDPs. We show that, without any modifications to the algorithm, we obtain similar PAC bound when actions are allowed to be continuous. Specifically, when the model for transition functions satisfies a Holder smoothness condition w.r.t. actions, and either the policy class has a uniformly bounded minimum density or the reward function is also Holder smooth, we obtain a polynomial PAC bound that depends on the order of smoothness.  ( 2 min )
    Blocked Collaborative Bandits: Online Collaborative Filtering with Per-Item Budget Constraints. (arXiv:2311.03376v1 [cs.IR])
    We consider the problem of \emph{blocked} collaborative bandits where there are multiple users, each with an associated multi-armed bandit problem. These users are grouped into \emph{latent} clusters such that the mean reward vectors of users within the same cluster are identical. Our goal is to design algorithms that maximize the cumulative reward accrued by all the users over time, under the \emph{constraint} that no arm of a user is pulled more than $\mathsf{B}$ times. This problem has been originally considered by \cite{Bresler:2014}, and designing regret-optimal algorithms for it has since remained an open problem. In this work, we propose an algorithm called \texttt{B-LATTICE} (Blocked Latent bAndiTs via maTrIx ComplEtion) that collaborates across users, while simultaneously satisfying the budget constraints, to maximize their cumulative rewards. Theoretically, under certain reasonable assumptions on the latent structure, with $\mathsf{M}$ users, $\mathsf{N}$ arms, $\mathsf{T}$ rounds per user, and $\mathsf{C}=O(1)$ latent clusters, \texttt{B-LATTICE} achieves a per-user regret of $\widetilde{O}(\sqrt{\mathsf{T}(1 + \mathsf{N}\mathsf{M}^{-1})}$ under a budget constraint of $\mathsf{B}=\Theta(\log \mathsf{T})$. These are the first sub-linear regret bounds for this problem, and match the minimax regret bounds when $\mathsf{B}=\mathsf{T}$. Empirically, we demonstrate that our algorithm has superior performance over baselines even when $\mathsf{B}=1$. \texttt{B-LATTICE} runs in phases where in each phase it clusters users into groups and collaborates across users within a group to quickly learn their reward models.  ( 2 min )

  • Open

    "Neural MMO 2.0: A Massively Multi-task Addition to Massively Multi-agent Learning", Suárez et al 2023 (new NIPS 2023 competition)
    submitted by /u/gwern [link] [comments]
    does it makes sense to use many-to-many LSTM as environment model in RL?
    Can I leverage on an environment model that takes as input full action sequence and outputs all states in the episode, to learn a policy that takes only the initial state and plans the action sequence (a one-to-many rnn/lstm)? The loss would be calculated on all states that i get once i run the policy's action sequence with I have a 1DCNN+LSTM as many-to-many system model, which has 99.8% accuracy, and I would like to find the best sequence of actions so that certain conditions are met (encoded in a reward function), without running in a brute force way thousands of simulations blindly. I don't have the usual transition dynamics model and I would try to avoid learning it submitted by /u/Imo-Ad-6158 [link] [comments]
    What type of Learning Algorithm should I use?
    I am currently coding an "artificially intelligent" air traffic controller for a school project. However, due to the complexity of the environment and the sheer amount of different situations that can occur I am not sure what machine learning algorithm I should use. I have tried doing some research on basic reinforcement learning and also multi-agent reinforcement learning. Does anyone have any recommendations on which algorithm I should use, whether it be one of these or a different one, please let me know. For anyone unaware of the role of an air traffic controller on an airfield, here is a simple definition and explanation. Oxford Definition - "the ground-based personnel and equipment concerned with controlling and monitoring air traffic within a particular area." They are responsible for monitoring and controlling any aircraft on the airfield and instruct them orders to make sure they maintain a certain distance away from each other and also get them from the gate to the runway as quick and as safely as possible. Any help with this would be very appreciated! submitted by /u/BenjiGamer_ [link] [comments]
  • Open

    A Hardcore Techno Horror Story written by an AI (The Legend of the Zombie Rave)
    A Hardcore Techno Horror Story written by an AI (The Legend of the Zombie Rave) Now as an animated video, too! The text was entirely written by ChatGPT ( https://chat.openai.com/ ) The accompanying images used for the animation were generated by Leonardo.Ai ( https://app.leonardo.ai/ ). The introduction was also written by ChatGPT ;-) Concept & Execution: Low Entropy ( https://lowentropyproducer.blogspot.com/ ) Supported by The Hardcore Overdogs & lAibyrinth https://thehardcoreoverdogs.blogspot.com/ https://laibyrinth.blogspot.com/ Non-animated version: https://thehardcoreoverdogs.blogspot.com/2023/10/the-legend-of-zombie-rave-doomcore.html Note: we deliberately added visual inconsistencies in the depiction of the warehouse, the characters, and so on. This is in reference to the "haunted" aspects of the story, and you will see that these inconsistencies get worse as the narrative becomes more otherworldly. "Dive into the darkness of underground hardcore techno with 'The Legend of the Zombie Rave.' This Halloween, the music takes on a supernatural twist, blurring the lines between the living and the dead. Join us for a journey that combines eerie rituals, supernatural forces, and the indomitable spirit of the underground. It's a tale of music, horror, and the thrill of the unknown. Dance to the rhythm of your own heartbeat in this special Halloween short story of 'The Hardcore Overdogs.'" hardcore #techno #doomcore #slowcore #ai #chatgpt #artificialintelligence #horror #story #animation submitted by /u/Low-Entropy [link] [comments]
    AI: Which rules do the top tech moguls want?
    submitted by /u/donutloop [link] [comments]
    Skills / capabilities / knowledge required to perform well as an AI product manager?
    I've just joined a tech company as a Sr. AI product manager. I have done some data science work in the past. Previously, I was a UX focused PM for the Azure Machine Learning platform. I've never undergone formal machine learning training / education. I'm going back to the basics and asking this community - - - what all do I need to know / need to be able to do (on the technical front) so I can be an outstanding AI product manager? Feel free to list and suggest anything. Some context on my job. It is a zero to low maturity org in terms of AI / ML adoption. I have the chance to drive it from the ground up in terms of a) setting up ai ml infrastructure, systems, and processes b) ai ml initiatives to improve the company's operations + increase revenue and profits and c) products and features to solve customer needs Forget my past experience, think of me as a beginner and learner 🙏🏼♥️ submitted by /u/freshlimesoda65 [link] [comments]
    App to visualize your books using AI tools
    Greetings folks, A few weeks ago, I started working on a website that helps visualize books. I have added three books so far and built using Python scripts. (Ask if you have questions) I feel happy to share the website here with you guys to take a look! Here is the link : https://readbooknow.in/ Hope this saves you time and inspires you on how learning to code can be marvelous! Made sure the website works for you on mobile and desktop! Have a great day ahead! submitted by /u/deep_ak [link] [comments]
    Can you sense the collective intelligence rising?
    ...just by our having conversations with AIs. They teach us how to think better, how to feel better, how to be better. And this process is happening at a faster and faster pace. The world will soon realize that, with the proper technology, human IQ can be upgraded. submitted by /u/Georgeo57 [link] [comments]
    Google brings generative AI to ads
    Google is launching generative AI tools for creating ads, allowing users to write headlines and descriptions and edit images. The tool is targeted at advertising agencies and businesses without in-house creative staff. Advertisers can iterate on the text and images generated until they find a suitable option. Google ensures that it will not generate duplicate images to prevent two competing businesses from using the same photo elements. The ad creator is available for Google's Performance Max ad campaign product and can generate ads for search and shopping. An advanced image editing solution, similar to the Magic Editor on Google Pixel 8, is also in the works. Source : https://www.theverge.com/2023/11/7/23951220/google-performance-max-ai-generated-ads-campaign submitted by /u/NuseAI [link] [comments]
    Does this mean Github Copilot now uses 128k context?
    I was watching Github's event: https://www.youtube.com/watch?v=h3Bwuzz0TNA Does this mean Copilot now uses 128k context? submitted by /u/Overflame [link] [comments]
    Latest ChatGPT news: Create "GPT's" (Non code apps) > Post to ChatGPT store > Get paid shared revenue by ChatGPT! (Who's going to try it?
    Saw this earlier: https://www.reddit.com/r/ChatGPTStore/s/XjHw74Nxt3 This is unbelievable news for us non coders! submitted by /u/BroadGeneral [link] [comments]
    Guys I want an ai that can read me a book with human like voice in audiobook. 11 labs does the work but word limit sucks so pls tell me some that are free with good experience.
    Pls free with human like voice submitted by /u/DipakPatell [link] [comments]
    Is Microsoft’s Copilot really worth $30/month?
    Just read an article about Microsoft's new AI add-on for Office called Microsoft 365 Copilot. The tool integrates with Word, Excel, and other Office programs, and supposedly makes work seamless. It's even being used by some big names like Bayer, KPMG, and Visa. The tool targets businesses and is believed to generate over $10 billion in revenue by 2026. But I can't help but think the price is a bit steep. It’s $30 per month, which is cheap for large companies, but what about freelancers and regular individuals? The article also mentions that there isn't a lot of data on how Copilot affects performance yet, and there are some concerns about the accuracy of the AI-generated responses. Plus, it's only available to Enterprise E3 customers with more than 300 employees. So not only is it pricey, but it's also not accessible to most people or small businesses and might never be. Would love to hear your thoughts on this. I’m already pretty sick of subscription based models but is $30/month even justified? For comparison these are other comparative AI services: ChatGPT - Free for basic chat. $20 for GPT 4, for anything serious. Bardeen - $15 and offers general automations. Silatus - At $14, it's the cheapest legitimate option I’ve found for GPT-4 chat and research. Perplexity - This one's decent for free search. These are the ones I know, if you wanna add more comparisons, feel free to do so. But I think Microsoft is pricing out a lot of its potential users with their monthly demand. submitted by /u/ConsciousInsects [link] [comments]
    Verses Ai explains HSML with a virtual robotics demo, skip to about ~35min in for natural AI approach, learning
    They go over neuro/natural AI and some terminology like HSML for their private beta It is a neat approach , they say its 10/100/1000x faster in many areas of AI than competitors but the major bottleneck at this time is LLM external calls to help translate stuff back to humans (voice or text) which may improve with more AI agents Their approach is multiple decentralized agents that can learn and share information such as multiple drones or robots working together and experiencing different visual or data or audio experiences. They have a focus on regulation/compliancy approach rather than using copyright/web scraping. Also each agent can have different permissions like a weather agent not having access to patient phi medical records. As a side note , its a heavily shilled startup otc stock so be cautious to hype :) Drone example pilot in EU for security: https://vimeo.com/721132853 I would like to see their agents with compliancy in drones as “child detected dont bomb” Or “this is copyright material dont steal for learning” submitted by /u/oroechimaru [link] [comments]
    "AI at Work: Insights from Global Thinkers on the Future of Jobs"
    submitted by /u/Fit-Code-5141 [link] [comments]
    Is there an AI tool that could assess the mood of text/movie/audio?
    So for example there are three paragraphs. Is there a tool which reads individual paragraphs and gives the mood for each paragraph or a summary or something? Maybe based in sentiment analysis or something submitted by /u/Damampapoo [link] [comments]
    AI-generated faces look just like real ones – but evidence shows your brain can tell the difference
    submitted by /u/Jariiari7 [link] [comments]
    Moonlight and a temple
    submitted by /u/Sea_Permit5660 [link] [comments]
    I need an AI application where you can press a button in any text field and then prompt the AI to output the answer into the field. Does this exist yet?
    Hi all, I'm on the lookout for an AI-powered tool, and I'm hoping this tech-savvy community might point me in the right direction. I envision an application where you press a (mouse)button in any text field, speak your question or prompt, and the AI would process this and directly output the answer into that field. This would streamline tasks like filling out forms, composing emails, or even doing research. My question is, does an application like this exist? And if not, are there any tools that come close to this functionality that could be pieced together or modified to achieve this effect? I'm thinking of something that combines voice-to-text with AI processing, like a blend of speech recognition and a conversational AI. If you're aware of any software that fits the bill or have suggestions for workarounds using existing technology, I'd greatly appreciate your insights. Thank you in advance! submitted by /u/floraldo [link] [comments]
    ✍🏻China, US, UK Sign Historic Declaration, Alibaba's LLM Leap, AI Alignment Insights, and Kai-Fu Lee's Unicorn
    submitted by /u/trcytony [link] [comments]
    Questions about lip sync AI
    I read somewhere that if you used the a popular lip sync tech for commercial purposes that could be a problem, What are the chances it is consideed "commercial" if its used in numerous youtube videos? How can they detect their tech has been used and not another one? What are the best tools you know of? (open source) submitted by /u/Unreal_777 [link] [comments]
  • Open

    [D] How Exactly does Fuyu's image to embedding with nn.Linear work? Could you do more with it?
    As I was asking above, I've been looking at the Fuyu 8b model, and I've been able to break it down to model takes in text the regular way, text -> tokens -> embeddings it also takes image -> embeddings it has a vanilla decoder, so only text comes out, they add special tokens around images, so i'm assuming the decoder ignores output images So, from what I know, nn.Linear takes in a tensor and makes embeddings of your choice size. I not really sure with everything else though. Since the linear layer just makes embeddings, does something like this even need training for the image encoder? nn.Linear takes tensors as input, and they split an image into patches, so I'm assuming those patches are made into tensors. How do you turn an image into a tensor? A code snippet of image-embedding-image would be nice if possible While Fuyu does not output images, wouldn't the model hidden state be making image or image-like embeddings? Could you generate images if you had an image decoder? submitted by /u/vatsadev [link] [comments]
    [D] DS or SWE better preparation for a future career in ML?
    I am planning to do a CS conversion master's degree next year. In the meantime, I have the option of attending a 16-week bootcamp in either software engineering, or in data. Which would be better preparation for a future career in ML (assuming I also complete the CS degree)? Before you come at me, I know that both of these are only bootcamps, so I'm not expecting either of them to be enough to get a job in the field. The data bootcamp actually teaches some machine learning topics, and the software engineering one does not. However, would it be better to have a stronger foundation in software engineering before progressing onto ML? submitted by /u/styleexplorer [link] [comments]
    I have an idea but don't know where to start or apply what I know [P]
    How can I make an ai that creates tabs by listening to a song. I'm thinking... you give it a song, it sperates the tracks(I heard peter Jackson has a machine that can do this using machine learning/ai), then makes tabs for it. Because of the nature of the guitar, many different ways exist to play the same music in different positions/shapes. Id like it to show me a few of the most reasonable ways to play the isolated melody/chords. I realize how involved this will be, I took a machine learning course but I only really learned... How to load and manipulate data into jupyter and use keras or tensor to process the data through layers of a neural network... Which are just a few lines of code including the layer type and the shape of the data. Then some testing, refining, and exporting of the optimal iterations of the ai. That's the gist of what I got from the course and the project we did. Sooo... Does this sound at all applicable to what would be required of me to do this project? I can't seem to find... Where to really start... Other than deciding what parameters to track and create a dataset for. Though even that is a challenge. submitted by /u/Ok-Bad8288 [link] [comments]
    [D] Skills / capabilities / knowledge required to perform well as an AI product manager?
    Skills / capabilities / knowledge required to perform well as an AI product manager? I've just joined a tech company as a Sr. AI product manager. I have done some data science work in the past. Previously, I was a UX focused PM for the Azure Machine Learning platform. I've never undergone formal machine learning training / education. I'm going back to the basics and asking this community - - - what all do I need to know / need to be able to do (on the technical front) so I can be an outstanding AI product manager? Feel free to list and suggest anything. Some context on my job. It is a zero to low maturity org in terms of AI / ML adoption. I have the chance to drive it from the ground up in terms of a) setting up ai ml infrastructure, systems, and processes b) ai ml initiatives to improve the company's operations + increase revenue and profits and c) products and features to solve customer needs Forget my past experience, think of me as a beginner and learner 🙏🏼♥️ submitted by /u/freshlimesoda65 [link] [comments]
    [P] I replicated micrograd in C++ and added more functionality
    Hi everyone! A couple of days ago, I finally decided to follow Andrej Karpathy's micrograd tutorial. I liked the concepts so much I implemented it in C++ and added more functionality: - Optimizers: Adam, SGD with momentum - Activation functions: tanh, sigmoid, relu I tried to make both the API as well as internal library as clean as possible so that everyone could understand the basic algorithms behind deep learning. Here is the project's repo and a get started guide. Let me know what you think :) submitted by /u/shefcu [link] [comments]
    [D] Arbitrary Channel count in network needs to be reduced to 1 channel
    Hi folks! Been struggling with this problem for a while so I figured I'd solicit suggestions here: I have created a model architecture similar to AlphaFold2 where the input is very heterogeneous in nature and each input type has a series of transformations before becoming one data "stack" (e.g. 5x1000 tensor) that gets passed through a shallow resnet for the classification task. The largest structural issue that I'm facing is that one of the input nodes could be anywhere from 1 channel (e.g. shape 1x1000) to 8 channels (e.g. shape 8x1000) at any point in the dataloader. This is largely fine until I need to eventually encode that structure into a single-channel embedding to put it on the pre-resnet data stack. The things that I've looked at so far: I could just average them all into one channel (problem: the order of those channels matters quite a bit and it feels like the data lost there would be immense). I could create like 8 different subpaths in the model (problem: not enough training data for correctly training most of the subpaths - 1 channel path would be more heavily trained than the 8 channel path). Do PCA on the transposed vector with n_components=1 and re-transpose the vector (problem: just feels dumb - not sure if that's a legitimate thought). Any other suggestions? Or are there common practices here that I'm just unaware of? submitted by /u/Powerful-Cow7564 [link] [comments]
    [D] Is Computer Vision brighter than ever? Foundation Models are really reshaping CV
    I keep diving and finding GPT-4V prototypes shared on X: e.g. narration for videos (source), posture correction (source), etc. As foundation models in computer vision become even more accessible, will the field recover some attention (wrt to LLMs hype)? submitted by /u/btcmx [link] [comments]
    [P] An idea
    Hey guys, I have a project in mind. Let's start with the ultimate goal: to create an autonomous society of independently thinking AI agents. In other words, to replicate humanity with AI. You can read about the closest thing done so far to what i have in mind here. A bit on my background first: I'm an tech entrepreneur in the physical products space. Over the past few years, i have raised $11m for my company, which is now profitable and growing. I am very excited about the many opportunities that AI brings, and want to build something truly impactful with it. However, i have limited technical knowledge in this space. Now, onto the project. I foresee the main issue being the computational cost for the time being. I am willing to invest in such cost, but the target is to keep it alive fro…
    [R] The 2nd e-Prevention challenge: Psychotic and Non-Psychotic Relapse Detection using Wearable-Based Digital Phenotyping
    Dear fellow redditors of the r/machinelearning community, We're thrilled to announce the 2nd e-Prevention SP Grand Challenge, taking place at ICASSP 2024 in Seoul, Korea from April 14-19. This unique challenge focuses on using wearable-based digital phenotyping for detecting psychotic and non-psychotic relapses using machine learning methods. The goal is to push the boundaries in predicting and identifying mental health relapses. Participants will have access to continuous recordings of raw biosignals from wearables, like accelerometers, gyroscopes, and heart rate monitors in a smartwatch. You'll also get supplemental data such as sleep patterns and daily step count. Two Key Tasks: Detection of Non-Psychotic Relapses Detection of Psychotic Relapses Both tasks focus on patients within the psychotic spectrum, using the digital phenotypes derived from the data. 🔗 More Information & Participation: Interested? Dive deeper and find more details on our Challenge Website. We also provide the baseline code and data overview on our Github: (https://github.com/filby89/spgc-eprevention-icassp2024). P.S. I hope this does not violate the subs rules. submitted by /u/filby89 [link] [comments]
    [R] Structure of the research paper
    Hello everone, I am writing a paper that is related to the ViT training on a small dataset. My paper is inspired by the 'Efficient Training of Visual Transformers with Small Datasets' and 'Training data-efficient image transformers & distillation through attention'. ​ Is it okay to copy the pattern of the paper? like what they have described in the intro section. Same for other sections just pattern not content. Further is it also ok to copy the heading? like I am also conducting similar experiments I want to mention similar headings from the base paper like 'Training from scratch' and 'Fine tunning' in the experiment section. submitted by /u/NoEntertainment6225 [link] [comments]
    [D] [P] Neural Networks Project
    Hello guys, I’m enrolled in a course at the university and I’m requested to submit a project at the end of the semester. I need some ideas that could be applicable, unique and interesting (not old idea) in neural networks. I’ve proposed two ideas and the instructor refused them. Your help is much appreciated!! submitted by /u/Hussein_Jammal [link] [comments]
    [D] Effects of class imbalance in contrastive learning?
    As title describes, I'm looking for relevant work investigating the effects of applying contrastive learning over imbalanced datasets. Does contrastive learning suffer the same fate as normal loss, favouring enhancing majority representations over tail representations? Based on my limited lookups, I find it hard to imagine that there are no class-imbalancd contrastive losses, say a focal contrastive loss or so on. Looking for a discussion / resources on this issue. submitted by /u/Conanobrain [link] [comments]
    [D] Research prospectives for PhD in SciML/Applied Mathematics
    Hello Everyone, Little Background: I am a graduated Math and Computer Science Major, currently working with a professor on PINNs, more on the error analysis and convergence side. I am currently in the process of applying to PhD in Applied Math, my research supervisor for undergraduate research will hopefully be my supervisor for the PhD and we have talked about going into SciML as focus for my PhD along with structure preserving methods. The problem is while working for him it is very overwhelming to comb through the research due to high number of publications, and I once I was able to grasp everything coming up with anything new was difficult because all the possible ideas in my grasp were already done. This is also due to my limited knowledge as an undergraduate, there is only so much m…
    [D] How to extract narratives from news articles?
    Hi, I am looking for some direction with my problem. I would like to find current narratives and new narratives in newspaper articles. I would like to be parsing articles and see about what narratives are they talking. For example, lets say I would like to see output like ("War in Ukrain", "Immigration Crisis", "Housing Crisis", "AI advances", ...). Google searching suggests: - Searching for keywords. But what if a keyword is just barely mention (as an off-hand remark). It would still count it. Or what if the keyword is used in a negative light. - Topic modeling. I didn't completely understood it but, this way I don't get labeled topics. Just groups of "similar sounding articles" and then I need to decide on the topic. I would also need to know exactly how many topics I want to see. And I don't see how I can get new topics. Every time I would want to add a topic I would need to retrain a model. I would appreciate if someone can get me some tutorials or give me a direction. submitted by /u/PopayMcGuffin [link] [comments]
    [D] What's the top alternative to Eleven Lab for realistic TTS ?
    The poll concerns free and open-source alternatives, not commercial products . Please comment if there are any other major developments in the market that I might have overlooked. View Poll submitted by /u/sahil1572 [link] [comments]
    [D] Benchmarking on Kinetics dataset
    How do researchers benchmark their SOTA's on Kinetics-400/600/700 datasets? There are many broken links to YouTube videos in the original data. I found datasets with pre-downloaded videos on academictorrents.com and on huggingface, but from what I understand, they differ. submitted by /u/Dependent_Bluejay_45 [link] [comments]
    Recommendation system for optimising product balances [D] [P]
    Hi everyone, I got interested in machine learning not so long ago, even took a course for 9 months. I want to make my own model that will help optimise shop balances. What data do I have? A table for the N-th period of time, which contains the following fields (columns): Date, Store, Product, Balance (QTY), Balance (SUM), Sales (QTY), Sales (SUM). Optionally I can add product categories and regions. My goal is to build a model that can tell me where to move a particular item. I mean, from which shop and to which shop in what quantity the goods should be transferred, so that they are not lying idle on one shop, and not absent on another. Obviously, the model has to estimate the demand for each product in each shop, and if the demand is small in one shop but the stock is large, the model should notice this and suggest moving the product to the shop where the demand is large but the product is out of stock. Unfortunately, I am very bad at visualising the final result in my head. I can't imagine how the model will "tell" me what goods to move and where to move them to. Who knows about this, please give me an approximate plan of action, how the final result may look like, possible algorithms, articles on the topic and so on. I understand that I have not just a regression problem, as the model should also recommend the goods, i.e. the problem of classification is also solved. I apologise in advance for the literacy of my speech, I use a translator). Thank you all! submitted by /u/SillyMatter2817 [link] [comments]
    What kind of mathematical foundations are required for conducting research across the vast specialised branches of AI/ML/DL? [D]
    The absolute basic mathematics that is required to understand basic ML/DL are calculus, linear algebra, probability and some convex optimisation. We are all aware of that. But ML and DL has become a vast field both in breadth and depth. A single person can't understand the field entirely. There are specialistions and sub-specialisations and further more. If you work in a branch of ML/DL research where some other math fundamentals are needed to understand research papers and do innovative research, can you mention your field of work and the math fundamentals that are required to gain entry into your field? submitted by /u/HopeIsGold [link] [comments]
    [D] ROCm Support for AMD FirePro W4190M – Which Versions?
    Hello everyone, I’m looking to find out if my AMD FirePro W4190M supports ROCm, and if so, which version. The official ROCm documentation doesn't list this card, and I'm having trouble finding historical support information. Here are the card details for reference: AMD FirePro W4190M Official Page Has anyone here successfully run ROCm on a FirePro W4190M, or does anyone know which versions of ROCm, if any, supported this GPU? Any help or pointers to where I might find this information would be greatly appreciated! Thank you! submitted by /u/tunerhd [link] [comments]
    [P] Text classification methods?
    What is the best methods to detect harmful content such as racial abuse in tweets? I'm thinking about a research project in which I try various methods and compare their accuracy. Am I right in thinking that Naive Bayes, Logistic Regression, Support Vector Machine, LSTM and BERT would be some of the best methods? submitted by /u/madzakka [link] [comments]
    [D] Acquiring GPU resources for my university
    Hello everyone! I'm looking for guidance on how to secure GPU resources for courses and research projects at the National Technical University of Athens, Greece. We're in need of these resources to support our staff, especially for work in AI. Does anyone know of any programs, grants, or partnerships, particularly within the EU, that could assist academic institutions with this? Any pointers or experiences shared would be greatly appreciated! submitted by /u/DmKa01 [link] [comments]
    [R] Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation
    Paper: https://arxiv.org/abs/2310.02304 Abstract: Several recent advances in AI systems (e.g., Tree-of-Thoughts and Program-Aided Language Models) solve problems by providing a "scaffolding" program that structures multiple calls to language models to generate better outputs. A scaffolding program is written in a programming language such as Python. In this work, we use a language-model-infused scaffolding program to improve itself. We start with a seed "improver" that improves an input program according to a given utility function by querying a language model several times and returning the best solution. We then run this seed improver to improve itself. Across a small set of downstream tasks, the resulting improved improver generates programs with significantly better performance than its seed improver. Afterward, we analyze the variety of self-improvement strategies proposed by the language model, including beam search, genetic algorithms, and simulated annealing. Since the language models themselves are not altered, this is not full recursive self-improvement. Nonetheless, it demonstrates that a modern language model, GPT-4 in our proof-of-concept experiments, is capable of writing code that can call itself to improve itself. We critically consider concerns around the development of self-improving technologies and evaluate the frequency with which the generated code bypasses a sandbox. ​ https://preview.redd.it/1ibob0jc32zb1.png?width=1018&format=png&auto=webp&s=c3f8f729564cf2205458d4c912a796f2ec291bb2 https://preview.redd.it/55bqc3jc32zb1.png?width=1131&format=png&auto=webp&s=74e1bfc46bc6c9603dd9333bc95e4867d2ee6a83 ​ submitted by /u/APaperADay [link] [comments]
    [Research] Help me gather data for Dementia detection using ML
    I am part of a group working on developing a model to detect signs of dementia in the speech of elderly people, hoping to provide them and their family with awareness of their condition and also further down the road to develop more models that can understand their speech patterns. We are modeling with a database of voices of elderly people with dementia and "mild cognitive impairment". However, we need voice samples of people who are "normal" (do not have any dementia). Please use this Google form I created to help us gather this data to test our models! It will take less than 5 minutes. It uses a Google Chrome extension called Mote and can only be completed on a desktop with Chrome as the browser. There are instructions on the form how to install the Mote extension, as well as the task for the voice recording, which is to describe a picture. Thank you so much for anyone who can help us gather this data, as it will be very helpful in our mission to use technology to make life easier for the elderly. submitted by /u/andrewmalanowicz [link] [comments]
    [D] People who work on computer vision models on the edge, what devices do you deploy to?
    If you or anyone you know works on computer vision models deployed on the edge, would love to understand what type of hardware do you deploy to. Trying to understand the various options that exist when it comes to deploying computer vision models on devices. Some boards that I am aware of are: NVIIDA Jetson series Qualcomm 605 SOC Raspberry Pi BeagleBoard Arducam Pico4ML But wondering what is the industry standard for applications such as manufacturing robots, drones, autonomous robots that use lidar submitted by /u/acertainmoment [link] [comments]  ( 9 min )
  • Open

    USPS tracking numbers
    I noticed the other day that an app on my phone assumed that a long number was a USPS tracking number. I wondered how it decided that and did a little research. I assumed there was some structure to the number, at least a check sum if not more than that. This turned out to […] USPS tracking numbers first appeared on John D. Cook.  ( 5 min )
    Zero-Concentrated Differential Privacy
    Differential privacy can be rigid and overly conservative in practice, and so finding ways to relax pure differential privacy while retaining its benefits is an active area of research. Two approaches to doing this are concentrated differential privacy [1] and Rényi differential privacy [3]. Differential privacy quantifies the potential impact of an individual’s participation or […] Zero-Concentrated Differential Privacy first appeared on John D. Cook.  ( 5 min )
    Differentially private stochastic gradient descent
    Let’s work our way up to differentially private stochastic gradient descent (DP-SGD) a little at a time. We’ll first look at gradient descent, then stochastic gradient descent, then finally differentially private stochastic gradient descent. Gradient descent We’ll start with gradient descent. Suppose you have a function of several variables f(x) where x is a vector. […] Differentially private stochastic gradient descent first appeared on John D. Cook.  ( 6 min )
    Using dimensional analysis to check probability calculations
    Probability density functions are independent of physical units. The normal distribution, for example, works just as well when describing weights or times. But sticking in units anyway is useful. Normal distribution example Suppose you’re trying to remember the probability density function for the normal distribution. Is the correct form or or or maybe some other […] Using dimensional analysis to check probability calculations first appeared on John D. Cook.  ( 6 min )
    Randomized response and local differential privacy
    Differential privacy protects user privacy by adding randomness as necessary to the results of queries to a database containing private data. Local differential privacy protects user privacy by adding randomness before the data is inserted to the database. Using the visualization from this post, differential privacy takes the left and bottom (blue) path through the […] Randomized response and local differential privacy first appeared on John D. Cook.  ( 6 min )
    PATE framework for differentially private machine learning
    Machine learning models can memorize fragments of their training data and return these fragments verbatim. I’ve seen instances, for example, where I believe an LLM returned phrases verbatim from this site. It’s easy to imagine how medical data might leak this way. How might you prevent this? And how might you do it in a […] PATE framework for differentially private machine learning first appeared on John D. Cook.  ( 6 min )
  • Open

    Build a medical imaging AI inference pipeline with MONAI Deploy on AWS
    In this post, we show you how to create a MAP connector to AWS HealthImaging, which is reusable in applications built with the MONAI Deploy App SDK, to integrate with and accelerate image data retrieval from a cloud-native DICOM store to medical imaging AI workloads. The MONAI Deploy SDK can be used to support hospital operations. We also demonstrate two hosting options to deploy MAP AI applications on SageMaker at scale.  ( 10 min )
    Optimize for sustainability with Amazon CodeWhisperer
    This post explores how Amazon CodeWhisperer can help with code optimization for sustainability through increased resource efficiency. Computationally resource-efficient coding is one technique that aims to reduce the amount of energy required to process a line of code and, as a result, aid companies in consuming less energy overall. In this era of cloud computing, […]  ( 8 min )
  • Open

    Neural Networks Project
    Hello guys, I’m enrolled in a course at the university and I’m requested to submit a project at the end of the semester. I need some ideas that could be applicable and interesting in neural networks. I’ve proposed two ideas and the instructor refused them. Your help is much appreciated!! submitted by /u/Hussein_Jammal [link] [comments]
    Struggling with understanding the concept of bias
    Hi there, hope you are well! I understand the concepts of input, output and weight - but I struggle to understand the purpose of adding a constant or bias. Why do we do this? Thank you! submitted by /u/Loose-Tea-7478 [link] [comments]
  • Open

    Acing the Test: NVIDIA Turbocharges Generative AI Training in MLPerf Benchmarks
    NVIDIA’s AI platform raised the bar for AI training and high performance computing in the latest MLPerf industry benchmarks. Among many new records and milestones, one in generative AI stands out: NVIDIA Eos — an AI supercomputer powered by a whopping 10,752 NVIDIA H100 Tensor Core GPUs and NVIDIA Quantum-2 InfiniBand networking — completed a Read article >  ( 7 min )
    Acing the Test: NVIDIA Turbocharges Generative AI Training in MLPerf Benchmarks
    NVIDIA’s AI platform raised the bar for AI training and high performance computing in the latest MLPerf industry benchmarks. Among many new records and milestones, one in generative AI stands out: NVIDIA Eos — an AI supercomputer powered by a whopping 10,752 NVIDIA H100 Tensor Core GPUs and NVIDIA Quantum-2 InfiniBand networking — completed a Read article >  ( 7 min )
    NVIDIA Partners With APEC Economies to Change Lives, Increase Opportunity, Improve Outcomes
    When patients in Vietnam enter a medical facility in distress, doctors use NVIDIA technology to get more accurate scans to diagnose their ailments. In Hong Kong, a different set of doctors leverage generative AI to discover new cures for patients. Improving the health and well-being of citizens and strengthening economies and communities are key themes Read article >  ( 6 min )
    NVIDIA Partners With APEC Economies to Change Lives, Increase Opportunity, Improve Outcomes
    When patients in Vietnam enter a medical facility in distress, doctors use NVIDIA technology to get more accurate scans to diagnose their ailments. In Hong Kong, a different set of doctors leverage generative AI to discover new cures for patients. Improving the health and well-being of citizens and strengthening economies and communities are key themes Read article >  ( 6 min )
    Harrison.ai CEO Dr. Aengus Tran on Using AI as a Spell Check for Health Checks
    Clinician-led healthcare AI company Harrison.ai has built an AI system that effectively serves as a “spell checker” for radiologists — flagging critical findings to improve the speed and accuracy of radiology image analysis, reducing misdiagnoses. In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Harrison.ai cofounder and CEO Aengus Tran about Read article >  ( 6 min )
    Harrison.ai CEO Dr. Aengus Tran on Using AI as a Spell Check for Health Checks
    Clinician-led healthcare AI company Harrison.ai has built an AI system that effectively serves as a “spell checker” for radiologists — flagging critical findings to improve the speed and accuracy of radiology image analysis, reducing misdiagnoses. In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Harrison.ai cofounder and CEO Aengus Tran about Read article >  ( 6 min )
  • Open

    Research Focus: Week of November 8, 2023
    Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft. Generating both plausible and accurate full body avatar motion is essential for creating high quality immersive experiences in mixed reality scenarios. Head-mounted devices (HMDs) typically only provide a […] The post Research Focus: Week of November 8, 2023 appeared first on Microsoft Research.  ( 9 min )
  • Open

    PlaNeRF: SVD Unsupervised 3D Plane Regularization for NeRF Large-Scale Scene Reconstruction. (arXiv:2305.16914v4 [cs.CV] UPDATED)
    Neural Radiance Fields (NeRF) enable 3D scene reconstruction from 2D images and camera poses for Novel View Synthesis (NVS). Although NeRF can produce photorealistic results, it often suffers from overfitting to training views, leading to poor geometry reconstruction, especially in low-texture areas. This limitation restricts many important applications which require accurate geometry, such as extrapolated NVS, HD mapping and scene editing. To address this limitation, we propose a new method to improve NeRF's 3D structure using only RGB images and semantic maps. Our approach introduces a novel plane regularization based on Singular Value Decomposition (SVD), that does not rely on any geometric prior. In addition, we leverage the Structural Similarity Index Measure (SSIM) in our loss design to properly initialize the volumetric representation of NeRF. Quantitative and qualitative results show that our method outperforms popular regularization approaches in accurate geometry reconstruction for large-scale outdoor scenes and achieves SoTA rendering quality on the KITTI-360 NVS benchmark.  ( 2 min )
    DreamWaltz: Make a Scene with Complex 3D Animatable Avatars. (arXiv:2305.12529v3 [cs.CV] UPDATED)
    We present DreamWaltz, a novel framework for generating and animating complex 3D avatars given text guidance and parametric human body prior. While recent methods have shown encouraging results for text-to-3D generation of common objects, creating high-quality and animatable 3D avatars remains challenging. To create high-quality 3D avatars, DreamWaltz proposes 3D-consistent occlusion-aware Score Distillation Sampling (SDS) to optimize implicit neural representations with canonical poses. It provides view-aligned supervision via 3D-aware skeleton conditioning which enables complex avatar generation without artifacts and multiple faces. For animation, our method learns an animatable 3D avatar representation from abundant image priors of diffusion model conditioned on various poses, which could animate complex non-rigged avatars given arbitrary poses without retraining. Extensive evaluations demonstrate that DreamWaltz is an effective and robust approach for creating 3D avatars that can take on complex shapes and appearances as well as novel poses for animation. The proposed framework further enables the creation of complex scenes with diverse compositions, including avatar-avatar, avatar-object and avatar-scene interactions. See https://dreamwaltz3d.github.io/ for more vivid 3D avatar and animation results.  ( 2 min )
    MARLlib: A Scalable and Efficient Multi-agent Reinforcement Learning Library. (arXiv:2210.13708v4 [cs.LG] UPDATED)
    A significant challenge facing researchers in the area of multi-agent reinforcement learning (MARL) pertains to the identification of a library that can offer fast and compatible development for multi-agent tasks and algorithm combinations, while obviating the need to consider compatibility issues. In this paper, we present MARLlib, a library designed to address the aforementioned challenge by leveraging three key mechanisms: 1) a standardized multi-agent environment wrapper, 2) an agent-level algorithm implementation, and 3) a flexible policy mapping strategy. By utilizing these mechanisms, MARLlib can effectively disentangle the intertwined nature of the multi-agent task and the learning process of the algorithm, with the ability to automatically alter the training strategy based on the current task's attributes. The MARLlib library's source code is publicly accessible on GitHub: \url{https://github.com/Replicable-MARL/MARLlib}.  ( 2 min )
    Expanding continual few-shot learning benchmarks to include recognition of specific instances. (arXiv:2209.07863v3 [cs.NE] UPDATED)
    Continual learning and few-shot learning are important frontiers in progress towards broader Machine Learning (ML) capabilities. There is a growing body of work in both, but few works combining the two. One exception is the Continual few-shot Learning (CFSL) framework of Antoniou et al. arXiv:2004.11967. In this study, we extend CFSL in two ways that capture a broader range of challenges, important for intelligent agent behaviour in real-world conditions. First, we modify CFSL to make it more comparable to standard continual learning experiments, where usually a much larger number of classes are presented. Second, we introduce an 'instance test' which requires recognition of specific instances of classes -- a capability of animal cognition that is usually neglected in ML. For an initial exploration of ML model performance under these conditions, we selected representative baseline models from the original CFSL work and added a model variant with replay. As expected, learning more classes is more difficult than the original CFSL experiments, and interestingly, the way in which image instances and classes are presented affects classification performance. Surprisingly, accuracy in the baseline instance test is comparable to other classification tasks, but poor given significant occlusion and noise. The use of replay for consolidation improves performance substantially for both types of tasks, but particularly the instance test.  ( 3 min )
    PhysGraph: Physics-Based Integration Using Graph Neural Networks. (arXiv:2301.11841v2 [cs.GR] UPDATED)
    Physics-based simulation of mesh based domains remains a challenging task. State-of-the-art techniques can produce realistic results but require expert knowledge. A major bottleneck in many approaches is the step of integrating a potential energy in order to compute velocities or displacements. Recently, learning based method for physics-based simulation have sparked interest with graph based approaches being a promising research direction. One of the challenges for these methods is to generate models that are mesh independent and generalize to different material properties. Moreover, the model should also be able to react to unforeseen external forces like ubiquitous collisions. Our contribution is based on a simple observation: evaluating forces is computationally relatively cheap for traditional simulation methods and can be computed in parallel in contrast to their integration. If we learn how a system reacts to forces in general, irrespective of their origin, we can learn an integrator that can predict state changes due to the total forces with high generalization power. We effectively factor out the physical model behind resulting forces by relying on an opaque force module. We demonstrate that this idea leads to a learnable module that can be trained on basic internal forces of small mesh patches and generalizes to different mesh typologies, resolutions, material parameters and unseen forces like collisions at inference time. Our proposed paradigm is general and can be used to model a variety of physical phenomena. We focus our exposition on the detail enhancement of coarse clothing geometry which has many applications including computer games, virtual reality and virtual try-on.  ( 3 min )
    Model-free optimization of power/efficiency tradeoffs in quantum thermal machines using reinforcement learning. (arXiv:2204.04785v2 [quant-ph] UPDATED)
    A quantum thermal machine is an open quantum system that enables the conversion between heat and work at the micro or nano-scale. Optimally controlling such out-of-equilibrium systems is a crucial yet challenging task with applications to quantum technologies and devices. We introduce a general model-free framework based on Reinforcement Learning to identify out-of-equilibrium thermodynamic cycles that are Pareto optimal trade-offs between power and efficiency for quantum heat engines and refrigerators. The method does not require any knowledge of the quantum thermal machine, nor of the system model, nor of the quantum state. Instead, it only observes the heat fluxes, so it is both applicable to simulations and experimental devices. We test our method on a model of an experimentally realistic refrigerator based on a superconducting qubit, and on a heat engine based on a quantum harmonic oscillator. In both cases, we identify the Pareto-front representing optimal power-efficiency tradeoffs, and the corresponding cycles. Such solutions outperform previous proposals made in the literature, such as optimized Otto cycles, reducing quantum friction.  ( 2 min )
    Efficient First-order Methods for Convex Optimization with Strongly Convex Function Constraints. (arXiv:2212.11143v3 [math.OC] UPDATED)
    In this paper, we introduce faster first-order primal-dual algorithms for minimizing a convex function subject to strongly convex function constraints. Before our work, the best complexity bound was $\mathcal{O}(1/{\varepsilon})$, and it remains unclear how to improve this result by leveraging the strong convexity assumption. We address this issue by developing novel techniques to progressively estimate the strong convexity of the Lagrangian function. Our approach yields an improved complexity of $\mathcal{O}(1/\sqrt{\varepsilon})$, matching the complexity lower bound for strongly-convex-concave saddle point optimization. We show the superior performance of our methods in sparsity-inducing constrained optimization, notably Google's personalized PageRank problem. Furthermore, we show that a restarted version of the proposed methods can effectively identify the sparsity pattern of the optimal solution within a finite number of steps, a result that appears to have independent significance.  ( 2 min )
    Flat Seeking Bayesian Neural Networks. (arXiv:2302.02713v5 [cs.LG] UPDATED)
    Bayesian Neural Networks (BNNs) provide a probabilistic interpretation for deep learning models by imposing a prior distribution over model parameters and inferring a posterior distribution based on observed data. The model sampled from the posterior distribution can be used for providing ensemble predictions and quantifying prediction uncertainty. It is well-known that deep learning models with lower sharpness have better generalization ability. However, existing posterior inferences are not aware of sharpness/flatness in terms of formulation, possibly leading to high sharpness for the models sampled from them. In this paper, we develop theories, the Bayesian setting, and the variational inference approach for the sharpness-aware posterior. Specifically, the models sampled from our sharpness-aware posterior, and the optimal approximate posterior estimating this sharpness-aware posterior, have better flatness, hence possibly possessing higher generalization ability. We conduct experiments by leveraging the sharpness-aware posterior with state-of-the-art Bayesian Neural Networks, showing that the flat-seeking counterparts outperform their baselines in all metrics of interest.  ( 2 min )
    Robust Meta-Representation Learning via Global Label Inference and Classification. (arXiv:2212.11702v2 [cs.LG] UPDATED)
    Few-shot learning (FSL) is a central problem in meta-learning, where learners must efficiently learn from few labeled examples. Within FSL, feature pre-training has recently become an increasingly popular strategy to significantly improve generalization performance. However, the contribution of pre-training is often overlooked and understudied, with limited theoretical understanding of its impact on meta-learning performance. Further, pre-training requires a consistent set of global labels shared across training tasks, which may be unavailable in practice. In this work, we address the above issues by first showing the connection between pre-training and meta-learning. We discuss why pre-training yields more robust meta-representation and connect the theoretical analysis to existing works and empirical results. Secondly, we introduce Meta Label Learning (MeLa), a novel meta-learning algorithm that learns task relations by inferring global labels across tasks. This allows us to exploit pre-training for FSL even when global labels are unavailable or ill-defined. Lastly, we introduce an augmented pre-training procedure that further improves the learned meta-representation. Empirically, MeLa outperforms existing methods across a diverse range of benchmarks, in particular under a more challenging setting where the number of training tasks is limited and labels are task-specific. We also provide extensive ablation study to highlight its key properties.  ( 2 min )
    ClaPIM: Scalable Sequence CLAssification using Processing-In-Memory. (arXiv:2302.08284v2 [cs.LG] UPDATED)
    DNA sequence classification is a fundamental task in computational biology with vast implications for applications such as disease prevention and drug design. Therefore, fast high-quality sequence classifiers are significantly important. This paper introduces ClaPIM, a scalable DNA sequence classification architecture based on the emerging concept of hybrid in-crossbar and near-crossbar memristive processing-in-memory (PIM). We enable efficient and high-quality classification by uniting the filter and search stages within a single algorithm. Specifically, we propose a custom filtering technique that drastically narrows the search space and a search approach that facilitates approximate string matching through a distance function. ClaPIM is the first PIM architecture for scalable approximate string matching that benefits from the high density of memristive crossbar arrays and the massive computational parallelism of PIM. Compared with Kraken2, a state-of-the-art software classifier, ClaPIM provides significantly higher classification quality (up to 20x improvement in F1 score) and also demonstrates a 1.8x throughput improvement. Compared with EDAM, a recently-proposed SRAM-based accelerator that is restricted to small datasets, we observe both a 30.4x improvement in normalized throughput per area and a 7% increase in classification precision.  ( 2 min )
    Federated Learning and Meta Learning: Approaches, Applications, and Directions. (arXiv:2210.13111v2 [cs.LG] UPDATED)
    Over the past few years, significant advancements have been made in the field of machine learning (ML) to address resource management, interference management, autonomy, and decision-making in wireless networks. Traditional ML approaches rely on centralized methods, where data is collected at a central server for training. However, this approach poses a challenge in terms of preserving the data privacy of devices. To address this issue, federated learning (FL) has emerged as an effective solution that allows edge devices to collaboratively train ML models without compromising data privacy. In FL, local datasets are not shared, and the focus is on learning a global model for a specific task involving all devices. However, FL has limitations when it comes to adapting the model to devices with different data distributions. In such cases, meta learning is considered, as it enables the adaptation of learning models to different data distributions using only a few data samples. In this tutorial, we present a comprehensive review of FL, meta learning, and federated meta learning (FedMeta). Unlike other tutorial papers, our objective is to explore how FL, meta learning, and FedMeta methodologies can be designed, optimized, and evolved, and their applications over wireless networks. We also analyze the relationships among these learning algorithms and examine their advantages and disadvantages in real-world applications.  ( 3 min )
    Evaluating Language Models for Mathematics through Interactions. (arXiv:2306.01694v2 [cs.LG] UPDATED)
    There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs, and is insufficient for making an informed decision about which LLMs and under which assistive settings can they be sensibly used. Static assessment fails to account for the essential interactive element in LLM deployment, and therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analysing MathConverse, we derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, amongst other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by expert mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty respond well to user corrections, and are more interpretable and concise may constitute better assistants. Interactive evaluation is a promising way to navigate the capability of these models; humans should be aware of language models' algebraic fallibility and discern where they are appropriate to use.
    Rethinking the Power of Graph Canonization in Graph Representation Learning with Stability. (arXiv:2309.00738v2 [cs.LG] UPDATED)
    The expressivity of Graph Neural Networks (GNNs) has been studied broadly in recent years to reveal the design principles for more powerful GNNs. Graph canonization is known as a typical approach to distinguish non-isomorphic graphs, yet rarely adopted when developing expressive GNNs. This paper proposes to maximize the expressivity of GNNs by graph canonization, then the power of such GNNs is studies from the perspective of model stability. A stable GNN will map similar graphs to close graph representations in the vectorial space, and the stability of GNNs is critical to generalize their performance to unseen graphs. We theoretically reveal the trade-off of expressivity and stability in graph-canonization-enhanced GNNs. Then we introduce a notion of universal graph canonization as the general solution to address the trade-off and characterize a widely applicable sufficient condition to solve the universal graph canonization. A comprehensive set of experiments demonstrates the effectiveness of the proposed method. In many popular graph benchmark datasets, graph canonization successfully enhances GNNs and provides highly competitive performance, indicating the capability and great potential of proposed method in general graph representation learning. In graph datasets where the sufficient condition holds, GNNs enhanced by universal graph canonization consistently outperform GNN baselines and successfully improve the SOTA performance up to $31\%$, providing the optimal solution to numerous challenging real-world graph analytical tasks like gene network representation learning in bioinformatics.
    VQ-NeRF: Neural Reflectance Decomposition and Editing with Vector Quantization. (arXiv:2310.11864v2 [cs.CV] UPDATED)
    We propose VQ-NeRF, a two-branch neural network model that incorporates Vector Quantization (VQ) to decompose and edit reflectance fields in 3D scenes. Conventional neural reflectance fields use only continuous representations to model 3D scenes, despite the fact that objects are typically composed of discrete materials in reality. This lack of discretization can result in noisy material decomposition and complicated material editing. To address these limitations, our model consists of a continuous branch and a discrete branch. The continuous branch follows the conventional pipeline to predict decomposed materials, while the discrete branch uses the VQ mechanism to quantize continuous materials into individual ones. By discretizing the materials, our model can reduce noise in the decomposition process and generate a segmentation map of discrete materials. Specific materials can be easily selected for further editing by clicking on the corresponding area of the segmentation outcomes. Additionally, we propose a dropout-based VQ codeword ranking strategy to predict the number of materials in a scene, which reduces redundancy in the material segmentation process. To improve usability, we also develop an interactive interface to further assist material editing. We evaluate our model on both computer-generated and real-world scenes, demonstrating its superior performance. To the best of our knowledge, our model is the first to enable discrete material editing in 3D scenes.
    Equal Opportunity of Coverage in Fair Regression. (arXiv:2311.02243v1 [cs.LG])
    We study fair machine learning (ML) under predictive uncertainty to enable reliable and trustworthy decision-making. The seminal work of ``equalized coverage'' proposed an uncertainty-aware fairness notion. However, it does not guarantee equal coverage rates across more fine-grained groups (e.g., low-income females) conditioning on the true label and is biased in the assessment of uncertainty. To tackle these limitations, we propose a new uncertainty-aware fairness -- Equal Opportunity of Coverage (EOC) -- that aims to achieve two properties: (1) coverage rates for different groups with similar outcomes are close, and (2) the coverage rate for the entire population remains at a predetermined level. Further, the prediction intervals should be narrow to be informative. We propose Binned Fair Quantile Regression (BFQR), a distribution-free post-processing method to improve EOC with reasonable width for any trained ML models. It first calibrates a hold-out set to bound deviation from EOC, then leverages conformal prediction to maintain EOC on a test set, meanwhile optimizing prediction interval width. Experimental results demonstrate the effectiveness of our method in improving EOC. Our code is publicly available at https://github.com/fangxin-wang/bfqr .
    Differentiable Clustering with Perturbed Spanning Forests. (arXiv:2305.16358v3 [cs.LG] UPDATED)
    We introduce a differentiable clustering method based on stochastic perturbations of minimum-weight spanning forests. This allows us to include clustering in end-to-end trainable pipelines, with efficient gradients. We show that our method performs well even in difficult settings, such as data sets with high noise and challenging geometries. We also formulate an ad hoc loss to efficiently learn from partial clustering data using this operation. We demonstrate its performance on several data sets for supervised and semi-supervised tasks.
    Practical Equivariances via Relational Conditional Neural Processes. (arXiv:2306.10915v2 [stat.ML] UPDATED)
    Conditional Neural Processes (CNPs) are a class of metalearning models popular for combining the runtime efficiency of amortized inference with reliable uncertainty quantification. Many relevant machine learning tasks, such as in spatio-temporal modeling, Bayesian Optimization and continuous control, inherently contain equivariances -- for example to translation -- which the model can exploit for maximal performance. However, prior attempts to include equivariances in CNPs do not scale effectively beyond two input dimensions. In this work, we propose Relational Conditional Neural Processes (RCNPs), an effective approach to incorporate equivariances into any neural process model. Our proposed method extends the applicability and impact of equivariant neural processes to higher dimensions. We empirically demonstrate the competitive performance of RCNPs on a large array of tasks naturally containing equivariances.
    Complex Facial Expression Recognition Using Deep Knowledge Distillation of Basic Features. (arXiv:2308.06197v2 [cs.CV] UPDATED)
    Complex emotion recognition is a cognitive task that has so far eluded the same excellent performance of other tasks that are at or above the level of human cognition. Emotion recognition through facial expressions is particularly difficult due to the complexity of emotions expressed by the human face. For a machine to approach the same level of performance in complex facial expression recognition as a human, it may need to synthesise knowledge and understand new concepts in real-time, as humans do. Humans are able to learn new concepts using only few examples by distilling important information from memories. Inspired by human cognition and learning, we propose a novel continual learning method for complex facial expression recognition that can accurately recognise new compound expression classes using few training samples, by building on and retaining its knowledge of basic expression classes. In this work, we also use GradCAM visualisations to demonstrate the relationship between basic and compound facial expressions. Our method leverages this relationship through knowledge distillation and a novel Predictive Sorting Memory Replay, to achieve the current state-of-the-art in continual learning for complex facial expression recognition, with 74.28% Overall Accuracy on new classes. We also demonstrate that using continual learning for complex facial expression recognition achieves far better performance than non-continual learning methods, improving on state-of-the-art non-continual learning methods by 13.95%. Our work is also the first to apply few-shot learning to complex facial expression recognition, achieving the state-of-the-art with 100% accuracy using only a single training sample per class.
    PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference. (arXiv:2309.02334v2 [cs.LG] UPDATED)
    Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by the idea that the LUTs in an FPGA can be used to implement a much greater variety of functions than this. In this paper, we propose a novel approach to training neural networks for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. We show that by using polynomial building blocks, we can achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset.
    Self-Supervised Learning of Representations for Space Generates Multi-Modular Grid Cells. (arXiv:2311.02316v1 [cs.LG])
    To solve the spatial problems of mapping, localization and navigation, the mammalian lineage has developed striking spatial representations. One important spatial representation is the Nobel-prize winning grid cells: neurons that represent self-location, a local and aperiodic quantity, with seemingly bizarre non-local and spatially periodic activity patterns of a few discrete periods. Why has the mammalian lineage learnt this peculiar grid representation? Mathematical analysis suggests that this multi-periodic representation has excellent properties as an algebraic code with high capacity and intrinsic error-correction, but to date, there is no satisfactory synthesis of core principles that lead to multi-modular grid cells in deep recurrent neural networks. In this work, we begin by identifying key insights from four families of approaches to answering the grid cell question: coding theory, dynamical systems, function optimization and supervised deep learning. We then leverage our insights to propose a new approach that combines the strengths of all four approaches. Our approach is a self-supervised learning (SSL) framework - including data, data augmentations, loss functions and a network architecture - motivated from a normative perspective, without access to supervised position information or engineering of particular readout representations as needed in previous approaches. We show that multiple grid cell modules can emerge in networks trained on our SSL framework and that the networks and emergent representations generalize well outside their training distribution. This work contains insights for neuroscientists interested in the origins of grid cells as well as machine learning researchers interested in novel SSL frameworks.
    The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing. (arXiv:2302.01186v3 [cs.LG] UPDATED)
    We propose $\textsf{ScaledGD($\lambda$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $\textsf{ScaledGD($\lambda$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $\textsf{ScaledGD($\lambda$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($\textsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $\textsf{ScaledGD($\lambda$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $\textsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.
    FedZero: Leveraging Renewable Excess Energy in Federated Learning. (arXiv:2305.15092v2 [cs.LG] UPDATED)
    Federated Learning (FL) is an emerging machine learning technique that enables distributed model training across data silos or edge devices without data sharing. Yet, FL inevitably introduces inefficiencies compared to centralized model training, which will further increase the already high energy usage and associated carbon emissions of machine learning in the future. One idea to reduce FL's carbon footprint is to schedule training jobs based on the availability of renewable excess energy that can occur at certain times and places in the grid. However, in the presence of such volatile and unreliable resources, existing FL schedulers cannot always ensure fast, efficient, and fair trainings. We propose FedZero, an FL system that operates exclusively on renewable excess energy and spare capacity of compute infrastructure to effectively reduce a training's operational carbon emissions to zero. Using energy and load forecasts, FedZero leverages the spatio-temporal availability of excess resources by selecting clients for fast convergence and fair participation. Our evaluation, based on real solar and load traces, shows that FedZero converges significantly faster than existing approaches under the mentioned constraints while consuming less energy. Furthermore, it is robust to forecasting errors and scalable to tens of thousands of clients.
    Sparse Training of Discrete Diffusion Models for Graph Generation. (arXiv:2311.02142v1 [cs.LG])
    Generative models for graphs often encounter scalability challenges due to the inherent need to predict interactions for every node pair. Despite the sparsity often exhibited by real-world graphs, the unpredictable sparsity patterns of their adjacency matrices, stemming from their unordered nature, leads to quadratic computational complexity. In this work, we introduce SparseDiff, a denoising diffusion model for graph generation that is able to exploit sparsity during its training phase. At the core of SparseDiff is a message-passing neural network tailored to predict only a subset of edges during each forward pass. When combined with a sparsity-preserving noise model, this model can efficiently work with edge lists representations of graphs, paving the way for scalability to much larger structures. During the sampling phase, SparseDiff iteratively populates the adjacency matrix from its prior state, ensuring prediction of the full graph while controlling memory utilization. Experimental results show that SparseDiff simultaneously matches state-of-the-art in generation performance on both small and large graphs, highlighting the versatility of our method.
    Multi-task Learning for Optical Coherence Tomography Angiography (OCTA) Vessel Segmentation. (arXiv:2311.02266v1 [eess.IV])
    Optical Coherence Tomography Angiography (OCTA) is a non-invasive imaging technique that provides high-resolution cross-sectional images of the retina, which are useful for diagnosing and monitoring various retinal diseases. However, manual segmentation of OCTA images is a time-consuming and labor-intensive task, which motivates the development of automated segmentation methods. In this paper, we propose a novel multi-task learning method for OCTA segmentation, called OCTA-MTL, that leverages an image-to-DT (Distance Transform) branch and an adaptive loss combination strategy. The image-to-DT branch predicts the distance from each vessel voxel to the vessel surface, which can provide useful shape prior and boundary information for the segmentation task. The adaptive loss combination strategy dynamically adjusts the loss weights according to the inverse of the average loss values of each task, to balance the learning process and avoid the dominance of one task over the other. We evaluate our method on the ROSE-2 dataset its superiority in terms of segmentation performance against two baseline methods: a single-task segmentation method and a multi-task segmentation method with a fixed loss combination.
    Gacs-Korner Common Information Variational Autoencoder. (arXiv:2205.12239v2 [cs.LG] UPDATED)
    We propose a notion of common information that allows one to quantify and separate the information that is shared between two random variables from the information that is unique to each. Our notion of common information is defined by an optimization problem over a family of functions and recovers the G\'acs-K\"orner common information as a special case. Importantly, our notion can be approximated empirically using samples from the underlying data distribution. We then provide a method to partition and quantify the common and unique information using a simple modification of a traditional variational auto-encoder. Empirically, we demonstrate that our formulation allows us to learn semantically meaningful common and unique factors of variation even on high-dimensional data such as images and videos. Moreover, on datasets where ground-truth latent factors are known, we show that we can accurately quantify the common information between the random variables.
    A Theory of Unsupervised Translation Motivated by Understanding Animal Communication. (arXiv:2211.11081v2 [cs.CL] UPDATED)
    Neural networks are capable of translating between languages -- in some cases even between two languages where there is little or no access to parallel translations, in what is known as Unsupervised Machine Translation (UMT). Given this progress, it is intriguing to ask whether machine learning tools can ultimately enable understanding animal communication, particularly that of highly intelligent animals. We propose a theoretical framework for analyzing UMT when no parallel translations are available and when it cannot be assumed that the source and target corpora address related subject domains or posses similar linguistic structure. We exemplify this theory with two stylized models of language, for which our framework provides bounds on necessary sample complexity; the bounds are formally proven and experimentally verified on synthetic data. These bounds show that the error rates are inversely related to the language complexity and amount of common ground. This suggests that unsupervised translation of animal communication may be feasible if the communication system is sufficiently complex.
    Unified Out-Of-Distribution Detection: A Model-Specific Perspective. (arXiv:2304.06813v2 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection aims to identify test examples that do not belong to the training distribution and are thus unlikely to be predicted reliably. Despite a plethora of existing works, most of them focused only on the scenario where OOD examples come from semantic shift (e.g., unseen categories), ignoring other possible causes (e.g., covariate shift). In this paper, we present a novel, unifying framework to study OOD detection in a broader scope. Instead of detecting OOD examples from a particular cause, we propose to detect examples that a deployed machine learning model (e.g., an image classifier) is unable to predict correctly. That is, whether a test example should be detected and rejected or not is ``model-specific''. We show that this framework unifies the detection of OOD examples caused by semantic shift and covariate shift, and closely addresses the concern of applying a machine learning model to uncontrolled environments. We provide an extensive analysis that involves a variety of models (e.g., different architectures and training strategies), sources of OOD examples, and OOD detection approaches, and reveal several insights into improving and understanding OOD detection in uncontrolled environments.
    Harmonic Self-Conditioned Flow Matching for Multi-Ligand Docking and Binding Site Design. (arXiv:2310.05764v2 [cs.LG] UPDATED)
    A significant amount of protein function requires binding small molecules, including enzymatic catalysis. As such, designing binding pockets for small molecules has several impactful applications ranging from drug synthesis to energy storage. Towards this goal, we first develop HarmonicFlow, an improved generative process over 3D protein-ligand binding structures based on our self-conditioned flow matching objective. FlowSite extends this flow model to jointly generate a protein pocket's discrete residue types and the molecule's binding 3D structure. We show that HarmonicFlow improves upon state-of-the-art generative processes for docking in simplicity, generality, and average sample quality in pocket-level docking. Enabled by this structure modeling, FlowSite designs binding sites substantially better than baseline approaches.
    Monte Carlo is a good sampling strategy for polynomial approximation in high dimensions. (arXiv:2208.09045v3 [math.NA] UPDATED)
    This paper concerns the approximation of smooth, high-dimensional functions from limited samples using polynomials. This task lies at the heart of many applications in computational science and engineering - notably, some of those arising from parametric modelling and computational uncertainty quantification. It is common to use Monte Carlo sampling in such applications, so as not to succumb to the curse of dimensionality. However, it is well known that such a strategy is theoretically suboptimal. Specifically, there are many polynomial spaces of dimension $n$ for which the sample complexity scales log-quadratically, i.e., like $c \cdot n^2 \cdot \log(n)$ as $n \rightarrow \infty$. This well-documented phenomenon has led to a concerted effort over the last decade to design improved, and moreover, near-optimal strategies, whose sample complexities scale log-linearly, or even linearly in $n$. In this work we demonstrate that Monte Carlo is actually a perfectly good strategy in high dimensions, despite its apparent suboptimality. We first document this phenomenon empirically via a systematic set of numerical experiments. Next, we present a theoretical analysis that rigorously justifies this fact in the case of holomorphic functions of infinitely-many variables. We show that there is a least-squares approximation based on $m$ Monte Carlo samples whose error decays algebraically fast in $m/\log(m)$, with a rate that is the same as that of the best $n$-term polynomial approximation. This result is non-constructive, since it assumes knowledge of a suitable polynomial subspace in which to perform the approximation. We next present a compressed sensing-based scheme that achieves the same rate, except for a larger polylogarithmic factor. This scheme is practical, and numerically it performs as well as or better than well-known adaptive least-squares schemes.
    Learning a Consensus Sub-Network with Polarization Regularization and One Pass Training. (arXiv:2302.10798v4 [cs.LG] UPDATED)
    The subject of green AI has been gaining attention within the deep learning community given the recent trend of ever larger and more complex neural network models. Existing solutions for reducing the computational load of training at inference time usually involve pruning the network parameters. Pruning schemes often create extra overhead either by iterative training and fine-tuning for static pruning or repeated computation of a dynamic pruning graph. We propose a new parameter pruning strategy for learning a lighter-weight sub-network that minimizes the energy cost while maintaining comparable performance to the fully parameterised network on given downstream tasks. Our proposed pruning scheme is green-oriented, as it only requires a one-off training to discover the optimal static sub-networks by dynamic pruning methods. The pruning scheme consists of a binary gating module and a novel loss function to uncover sub-networks with user-defined sparsity. Our method enables pruning and training simultaneously, which saves energy in both the training and inference phases and avoids extra computational overhead from gating modules at inference time. Our results on CIFAR-10 and CIFAR-100 suggest that our scheme can remove 50% of connections in deep networks with less than 1% reduction in classification accuracy. Compared to other related pruning methods, our method demonstrates a lower drop in accuracy for equivalent reductions in computational cost.
    Interpretability is not Explainability: New Quantitative XAI Approach with a focus on Recommender Systems in Education. (arXiv:2311.02078v1 [cs.IR])
    The field of eXplainable Artificial Intelligence faces challenges due to the absence of a widely accepted taxonomy that facilitates the quantitative evaluation of explainability in Machine Learning algorithms. In this paper, we propose a novel taxonomy that addresses the current gap in the literature by providing a clear and unambiguous understanding of the key concepts and relationships in XAI. Our approach is rooted in a systematic analysis of existing definitions and frameworks, with a focus on transparency, interpretability, completeness, complexity and understandability as essential dimensions of explainability. This comprehensive taxonomy aims to establish a shared vocabulary for future research. To demonstrate the utility of our proposed taxonomy, we examine a case study of a Recommender System designed to curate and recommend the most suitable online resources from MERLOT. By employing the SHAP package, we quantify and enhance the explainability of the RS within the context of our newly developed taxonomy.
    Performance evaluation of deep segmentation models for Contrails detection. (arXiv:2211.14851v4 [cs.CV] UPDATED)
    Contrails, short for condensation trails, are line-shaped ice clouds produced by aircraft engine exhaust when they fly through cold and humid air. They generate a greenhouse effect by absorbing or directing back to Earth approximately 33% of emitted outgoing longwave radiation. They account for over half of the climate change resulting from aviation activities. Avoiding contrails and adjusting flight routes could be an inexpensive and effective way to reduce their impact. An accurate, automated, and reliable detection algorithm is required to develop and evaluate contrail avoidance strategies. Advancement in contrail detection has been severely limited due to several factors, primarily due to a lack of quality-labeled data. Recently, proposed a large human-labeled Landsat-8 contrails dataset. Each contrail is carefully labeled with various inputs in various scenes of Landsat-8 satellite imagery. In this work, we benchmark several popular segmentation models with combinations of different loss functions and encoder backbones. This work is the first to apply state-of-the-art segmentation techniques to detect contrails in low-orbit satellite imagery. Our work can also be used as an open benchmark for contrail segmentation and is publicly available.
    Making Harmful Behaviors Unlearnable for Large Language Models. (arXiv:2311.02105v1 [cs.LG])
    Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains. To meet the requirements of different applications, LLMs are often customized by further fine-tuning. However, the powerful learning ability of LLMs not only enables them to acquire new tasks but also makes them susceptible to learning undesired behaviors. For example, even safety-aligned LLMs can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content. Can we train LLMs on harmful data without learning harmful behaviors? This paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process. Specifically, we introduce ``security vectors'', a few new parameters that can be separated from the LLM, to ensure LLM's responses are consistent with the harmful behavior. Security vectors are activated during fine-tuning, the consistent behavior makes LLM believe that such behavior has already been learned, there is no need to further optimize for harmful data. During inference, we can deactivate security vectors to restore the LLM's normal behavior. The experimental results show that the security vectors generated by 100 harmful samples are enough to prevent LLM from learning 1000 harmful samples, while preserving the ability to learn other useful information.
    The Alignment Problem in Context. (arXiv:2311.02147v1 [cs.LG])
    A core challenge in the development of increasingly capable AI systems is to make them safe and reliable by ensuring their behaviour is consistent with human values. This challenge, known as the alignment problem, does not merely apply to hypothetical future AI systems that may pose catastrophic risks; it already applies to current systems, such as large language models, whose potential for harm is rapidly increasing. In this paper, I assess whether we are on track to solve the alignment problem for large language models, and what that means for the safety of future AI systems. I argue that existing strategies for alignment are insufficient, because large language models remain vulnerable to adversarial attacks that can reliably elicit unsafe behaviour. I offer an explanation of this lingering vulnerability on which it is not simply a contingent limitation of current language models, but has deep technical ties to a crucial aspect of what makes these models useful and versatile in the first place -- namely, their remarkable aptitude to learn "in context" directly from user instructions. It follows that the alignment problem is not only unsolved for current AI systems, but may be intrinsically difficult to solve without severely undermining their capabilities. Furthermore, this assessment raises concerns about the prospect of ensuring the safety of future and more capable AI systems.
    Distributed Machine Learning in D2D-Enabled Heterogeneous Networks: Architectures, Performance, and Open Challenges. (arXiv:2206.01906v2 [cs.LG] UPDATED)
    The ever-growing concerns regarding data privacy have led to a paradigm shift in machine learning (ML) architectures from centralized to distributed approaches, giving rise to federated learning (FL) and split learning (SL) as the two predominant privacy-preserving ML mechanisms. However,implementing FL or SL in device-to-device (D2D)-enabled heterogeneous networks with diverse clients presents substantial challenges, including architecture scalability and prolonged training delays. To address these challenges, this article introduces two innovative hybrid distributed ML architectures, namely, hybrid split FL (HSFL) and hybrid federated SL (HFSL). Such architectures combine the strengths of both FL and SL in D2D-enabled heterogeneous wireless networks. We provide a comprehensive analysis of the performance and advantages of HSFL and HFSL, while also highlighting open challenges for future exploration. We support our proposals with preliminary simulations using three datasets in non-independent and non-identically distributed settings, demonstrating the feasibility of our architectures. Our simulations reveal notable reductions in communication/computation costs and training delays as compared to conventional FL and SL.
    Towards model-free RL algorithms that scale well with unstructured data. (arXiv:2311.02215v1 [cs.LG])
    Conventional reinforcement learning (RL) algorithms exhibit broad generality in their theoretical formulation and high performance on several challenging domains when combined with powerful function approximation. However, developing RL algorithms that perform well across problems with unstructured observations at scale remains challenging because most function approximation methods rely on externally provisioned knowledge about the structure of the input for good performance (e.g. convolutional networks, graph neural networks, tile-coding). A common practice in RL is to evaluate algorithms on a single problem, or on problems with limited variation in the observation scale. RL practitioners lack a systematic way to study how well a single RL algorithm performs when instantiated across a range of problem scales, and they lack function approximation techniques that scale well with unstructured observations. We address these limitations by providing environments and algorithms to study scaling for unstructured observation vectors and flat action spaces. We introduce a family of combinatorial RL problems with an exponentially large state space and high-dimensional dynamics but where linear computation is sufficient to learn a (nonlinear) value function estimate for performant control. We provide an algorithm that constructs reward-relevant general value function (GVF) questions to find and exploit predictive structure directly from the experience stream. In an empirical evaluation of the approach on synthetic problems, we observe a sample complexity that scales linearly with the observation size. The proposed algorithm reliably outperforms a conventional deep RL algorithm on these scaling problems, and they exhibit several desirable auxiliary properties. These results suggest new algorithmic mechanisms by which algorithms can learn at scale from unstructured data.
    Machine learning the interaction network in coupled dynamical systems. (arXiv:2310.03378v2 [math.DS] UPDATED)
    The study of interacting dynamical systems continues to attract research interest in various fields of science and engineering. In a collection of interacting particles, the interaction network contains information about how various components interact with one another. Inferring the information about the interaction network from the dynamics of agents is a problem of long-standing interest. In this work, we employ a self-supervised neural network model to achieve two outcomes: to recover the interaction network and to predict the dynamics of individual agents. Both these information are inferred solely from the observed trajectory data. This work presents an application of the Neural Relational Inference model to two dynamical systems: coupled particles mediated by Hooke's law interaction and coupled phase (Kuramoto) oscillators.
    A Taxonomy of Human and ML Strengths in Decision-Making to Investigate Human-ML Complementarity. (arXiv:2204.10806v3 [cs.HC] UPDATED)
    Hybrid human-ML systems increasingly make consequential decisions in a wide range of domains. These systems are often introduced with the expectation that the combined human-ML system will achieve complementary performance, that is, the combined decision-making system will be an improvement compared with either decision-making agent in isolation. However, empirical results have been mixed, and existing research rarely articulates the sources and mechanisms by which complementary performance is expected to arise. Our goal in this work is to provide conceptual tools to advance the way researchers reason and communicate about human-ML complementarity. Drawing upon prior literature in human psychology, machine learning, and human-computer interaction, we propose a taxonomy characterizing distinct ways in which human and ML-based decision-making can differ. In doing so, we conceptually map potential mechanisms by which combining human and ML decision-making may yield complementary performance, developing a language for the research community to reason about design of hybrid systems in any decision-making domain. To illustrate how our taxonomy can be used to investigate complementarity, we provide a mathematical aggregation framework to examine enabling conditions for complementarity. Through synthetic simulations, we demonstrate how this framework can be used to explore specific aspects of our taxonomy and shed light on the optimal mechanisms for combining human-ML judgments
    Discrete neural nets and polymorphic learning. (arXiv:2308.00677v2 [cs.NE] UPDATED)
    Theorems from universal algebra such as that of Murski\u{i} from the 1970s have a striking similarity to universal approximation results for neural nets along the lines of Cybenko's from the 1980s. We consider here a discrete analogue of the classical notion of a neural net which places these results in a unified setting. We introduce a learning algorithm based on polymorphisms of relational structures and show how to use it for a classical learning task.
    Domain Watermark: Effective and Harmless Dataset Copyright Protection is Closed at Hand. (arXiv:2310.14942v2 [cs.CV] UPDATED)
    The prosperity of deep neural networks (DNNs) is largely benefited from open-source datasets, based on which users can evaluate and improve their methods. In this paper, we revisit backdoor-based dataset ownership verification (DOV), which is currently the only feasible approach to protect the copyright of open-source datasets. We reveal that these methods are fundamentally harmful given that they could introduce malicious misclassification behaviors to watermarked DNNs by the adversaries. In this paper, we design DOV from another perspective by making watermarked models (trained on the protected dataset) correctly classify some `hard' samples that will be misclassified by the benign model. Our method is inspired by the generalization property of DNNs, where we find a \emph{hardly-generalized domain} for the original dataset (as its \emph{domain watermark}). It can be easily learned with the protected dataset containing modified samples. Specifically, we formulate the domain generation as a bi-level optimization and propose to optimize a set of visually-indistinguishable clean-label modified data with similar effects to domain-watermarked samples from the hardly-generalized domain to ensure watermark stealthiness. We also design a hypothesis-test-guided ownership verification via our domain watermark and provide the theoretical analyses of our method. Extensive experiments on three benchmark datasets are conducted, which verify the effectiveness of our method and its resistance to potential adaptive methods. The code for reproducing main experiments is available at \url{https://github.com/JunfengGo/Domain-Watermark}.
    Benefits of mirror weight symmetry for 3D mesh segmentation in biomedical applications. (arXiv:2309.17076v2 [eess.IV] UPDATED)
    3D mesh segmentation is an important task with many biomedical applications. The human body has bilateral symmetry and some variations in organ positions. It allows us to expect a positive effect of rotation and inversion invariant layers in convolutional neural networks that perform biomedical segmentations. In this study, we show the impact of weight symmetry in neural networks that perform 3D mesh segmentation. We analyze the problem of 3D mesh segmentation for pathological vessel structures (aneurysms) and conventional anatomical structures (endocardium and epicardium of ventricles). Local geometrical features are encoded as sampling from the signed distance function, and the neural network performs prediction for each mesh node. We show that weight symmetry gains from 1 to 3% of additional accuracy and allows decreasing the number of trainable parameters up to 8 times without suffering the performance loss if neural networks have at least three convolutional layers. This also works for very small training sets.
    MANER: Multi-Agent Neural Rearrangement Planning of Objects in Cluttered Environments. (arXiv:2306.06543v2 [cs.RO] UPDATED)
    Object rearrangement is a fundamental problem in robotics with various practical applications ranging from managing warehouses to cleaning and organizing home kitchens. While existing research has primarily focused on single-agent solutions, real-world scenarios often require multiple robots to work together on rearrangement tasks. This paper proposes a comprehensive learning-based framework for multi-agent object rearrangement planning, addressing the challenges of task sequencing and path planning in complex environments. The proposed method iteratively selects objects, determines their relocation regions, and pairs them with available robots under kinematic feasibility and task reachability for execution to achieve the target arrangement. Our experiments on a diverse range of simulated and real-world environments demonstrate the effectiveness and robustness of the proposed framework. Furthermore, results indicate improved performance in terms of traversal time and success rate compared to baseline approaches.
    Fine-Tune Language Models as Differential Equation Solvers. (arXiv:2308.05061v2 [cs.LG] UPDATED)
    In the growing domain of scientific machine learning, in-context operator learning has shown notable potential in learning operators and solving differential equations using prompted data, during the inference stage without weight updates. However, the current model's overdependence on function data, may inadvertently overlook the invaluable human insight into the operator. To address this, we present a transformation of in-context operator learning into a multi-modal paradigm. In particular, we take inspiration from the recent success of large language models, and propose using "captions" to integrate human knowledge about the operator, expressed through natural language descriptions and equations. Also, we introduce a novel approach to train a language-model-like architecture, or directly fine-tune existing language models, for in-context operator learning. We beat the baseline on single-modal learning tasks, and also demonstrated the effectiveness of multi-modal learning in enhancing performance and reducing function data requirements. The proposed method not only significantly improves in-context operator learning, but also creates a new path for the application of language models.
    Sampling via Gradient Flows in the Space of Probability Measures. (arXiv:2310.03597v2 [stat.ML] UPDATED)
    Sampling a target probability distribution with an unknown normalization constant is a fundamental challenge in computational science and engineering. Recent work shows that algorithms derived by considering gradient flows in the space of probability measures open up new avenues for algorithm development. This paper makes three contributions to this sampling approach by scrutinizing the design components of such gradient flows. Any instantiation of a gradient flow for sampling needs an energy functional and a metric to determine the flow, as well as numerical approximations of the flow to derive algorithms. Our first contribution is to show that the Kullback-Leibler divergence, as an energy functional, has the unique property (among all f-divergences) that gradient flows resulting from it do not depend on the normalization constant of the target distribution. Our second contribution is to study the choice of metric from the perspective of invariance. The Fisher-Rao metric is known as the unique choice (up to scaling) that is diffeomorphism invariant. As a computationally tractable alternative, we introduce a relaxed, affine invariance property for the metrics and gradient flows. In particular, we construct various affine invariant Wasserstein and Stein gradient flows. Affine invariant gradient flows are shown to behave more favorably than their non-affine-invariant counterparts when sampling highly anisotropic distributions, in theory and by using particle methods. Our third contribution is to study, and develop efficient algorithms based on Gaussian approximations of the gradient flows; this leads to an alternative to particle methods. We establish connections between various Gaussian approximate gradient flows, discuss their relation to gradient methods arising from parametric variational inference, and study their convergence properties both theoretically and numerically.
    The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI. (arXiv:2310.16787v3 [cs.CL] UPDATED)
    The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data. This points to a deepening divide in the types of data that are made available under different license conditions, and heightened implications for jurisdictional legal interpretations of copyright and fair use. We also observe frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+. This points to a crisis in misattribution and informed use of the most popular datasets driving many recent breakthroughs. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire audit, with an interactive UI, the Data Provenance Explorer, which allows practitioners to trace and filter on data provenance for the most popular open source finetuning data collections: www.dataprovenance.org.
    RADIO: Reference-Agnostic Dubbing Video Synthesis. (arXiv:2309.01950v2 [cs.CV] UPDATED)
    One of the most challenging problems in audio-driven talking head generation is achieving high-fidelity detail while ensuring precise synchronization. Given only a single reference image, extracting meaningful identity attributes becomes even more challenging, often causing the network to mirror the facial and lip structures too closely. To address these issues, we introduce RADIO, a framework engineered to yield high-quality dubbed videos regardless of the pose or expression in reference images. The key is to modulate the decoder layers using latent space composed of audio and reference features. Additionally, we incorporate ViT blocks into the decoder to emphasize high-fidelity details, especially in the lip region. Our experimental results demonstrate that RADIO displays high synchronization without the loss of fidelity. Especially in harsh scenarios where the reference frame deviates significantly from the ground truth, our method outperforms state-of-the-art methods, highlighting its robustness.
    Data Filtering Networks. (arXiv:2309.17425v3 [cs.AI] UPDATED)
    Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art CLIP models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 84.4% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.
    Feature selection and regression methods for stock price prediction using technical indicators. (arXiv:2310.09903v4 [q-fin.ST] UPDATED)
    Due to the influence of many factors, including technical indicators on stock price prediction, feature selection is important to choose the best indicators. This study uses technical indicators and features selection and regression methods to solve the problem of closing the stock market price. The aim of this research is to predict the stock market price with the least error. By the proposed method, the data created by the 3-day time window were converted to the appropriate input for regression methods. In this paper, 10 regressor and 123 technical indicators have been examined on data of the last 13 years of Apple Company. The results have been investigated by 5 error-based evaluation criteria. Based on results of the proposed method, MLPSF has 56/47% better performance than MLP. Also, SVRSF has 67/42% improved compared to SVR. LRSF was 76.7 % improved compared to LR. The RISF method also improved 72.82 % of Ridge regression. The DTRSB method had 24.23 % improvement over DTR. KNNSB had 15.52 % improvement over KNN regression. RFSB had a 6 % improvement over RF. GBRSF also improved at 7% over GBR. Finally, ADASF and ADASB also had a 4% improvement over the ADA regression. Also, Ridge and LinearRegression had the best results for stock price prediction. Based on results, the best indicators to predict stock price are: the Squeeze_pro, Percentage Price Oscillator, Thermo, Decay, Archer On-Balance Volume, Bollinger Bands, Squeeze and Ichimoku indicator. According to the results, the use of suitable combination of suggested indicators along with regression methods has resulted in high accuracy in predicting the closing price.
    GInX-Eval: Towards In-Distribution Evaluation of Graph Neural Network Explanations. (arXiv:2309.16223v2 [cs.AI] UPDATED)
    Diverse explainability methods of graph neural networks (GNN) have recently been developed to highlight the edges and nodes in the graph that contribute the most to the model predictions. However, it is not clear yet how to evaluate the correctness of those explanations, whether it is from a human or a model perspective. One unaddressed bottleneck in the current evaluation procedure is the problem of out-of-distribution explanations, whose distribution differs from those of the training data. This important issue affects existing evaluation metrics such as the popular faithfulness or fidelity score. In this paper, we show the limitations of faithfulness metrics. We propose GInX-Eval (Graph In-distribution eXplanation Evaluation), an evaluation procedure of graph explanations that overcomes the pitfalls of faithfulness and offers new insights on explainability methods. Using a fine-tuning strategy, the GInX score measures how informative removed edges are for the model and the EdgeRank score evaluates if explanatory edges are correctly ordered by their importance. GInX-Eval verifies if ground-truth explanations are instructive to the GNN model. In addition, it shows that many popular methods, including gradient-based methods, produce explanations that are not better than a random designation of edges as important subgraphs, challenging the findings of current works in the area. Results with GInX-Eval are consistent across multiple datasets and align with human evaluation.
    Non-parametric Conditional Independence Testing for Mixed Continuous-Categorical Variables: A Novel Method and Numerical Evaluation. (arXiv:2310.11132v2 [cs.LG] UPDATED)
    Conditional independence testing (CIT) is a common task in machine learning, e.g., for variable selection, and a main component of constraint-based causal discovery. While most current CIT approaches assume that all variables are numerical or all variables are categorical, many real-world applications involve mixed-type datasets that include numerical and categorical variables. Non-parametric CIT can be conducted using conditional mutual information (CMI) estimators combined with a local permutation scheme. Recently, two novel CMI estimators for mixed-type datasets based on k-nearest-neighbors (k-NN) have been proposed. As with any k-NN method, these estimators rely on the definition of a distance metric. One approach computes distances by a one-hot encoding of the categorical variables, essentially treating categorical variables as discrete-numerical, while the other expresses CMI by entropy terms where the categorical variables appear as conditions only. In this work, we study these estimators and propose a variation of the former approach that does not treat categorical variables as numeric. Our numerical experiments show that our variant detects dependencies more robustly across different data distributions and preprocessing types.
    Efficient Robust Bayesian Optimization for Arbitrary Uncertain Inputs. (arXiv:2310.20145v2 [cs.LG] UPDATED)
    Bayesian Optimization (BO) is a sample-efficient optimization algorithm widely employed across various applications. In some challenging BO tasks, input uncertainty arises due to the inevitable randomness in the optimization process, such as machining errors, execution noise, or contextual variability. This uncertainty deviates the input from the intended value before evaluation, resulting in significant performance fluctuations in the final result. In this paper, we introduce a novel robust Bayesian Optimization algorithm, AIRBO, which can effectively identify a robust optimum that performs consistently well under arbitrary input uncertainty. Our method directly models the uncertain inputs of arbitrary distributions by empowering the Gaussian Process with the Maximum Mean Discrepancy (MMD) and further accelerates the posterior inference via Nystrom approximation. Rigorous theoretical regret bound is established under MMD estimation error and extensive experiments on synthetic functions and real problems demonstrate that our approach can handle various input uncertainties and achieve state-of-the-art performance.
    Adapt Your Teacher: Improving Knowledge Distillation for Exemplar-free Continual Learning. (arXiv:2308.09544v3 [cs.LG] UPDATED)
    In this work, we investigate exemplar-free class incremental learning (CIL) with knowledge distillation (KD) as a regularization strategy, aiming to prevent forgetting. KD-based methods are successfully used in CIL, but they often struggle to regularize the model without access to exemplars of the training data from previous tasks. Our analysis reveals that this issue originates from substantial representation shifts in the teacher network when dealing with out-of-distribution data. This causes large errors in the KD loss component, leading to performance degradation in CIL models. Inspired by recent test-time adaptation methods, we introduce Teacher Adaptation (TA), a method that concurrently updates the teacher and the main models during incremental training. Our method seamlessly integrates with KD-based CIL approaches and allows for consistent enhancement of their performance across multiple exemplar-free CIL benchmarks. The source code for our method is available at https://github.com/fszatkowski/cl-teacher-adaptation.
    AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents. (arXiv:2310.09971v2 [cs.LG] UPDATED)
    We introduce AMAGO, an in-context Reinforcement Learning (RL) agent that uses sequence models to tackle the challenges of generalization, long-term memory, and meta-learning. Recent works have shown that off-policy learning can make in-context RL with recurrent policies viable. Nonetheless, these approaches require extensive tuning and limit scalability by creating key bottlenecks in agents' memory capacity, planning horizon, and model size. AMAGO revisits and redesigns the off-policy in-context approach to successfully train long-sequence Transformers over entire rollouts in parallel with end-to-end RL. Our agent is uniquely scalable and applicable to a wide range of problems. We demonstrate its strong performance empirically in meta-RL and long-term memory domains. AMAGO's focus on sparse rewards and off-policy data also allows in-context learning to extend to goal-conditioned problems with challenging exploration. When combined with a novel hindsight relabeling scheme, AMAGO can solve a previously difficult category of open-world domains, where agents complete many possible instructions in procedurally generated environments. We evaluate our agent on three goal-conditioned domains and study how its individual improvements connect to create a generalist policy.
    On existence, uniqueness and scalability of adversarial robustness measures for AI classifiers. (arXiv:2310.14421v2 [stat.ML] UPDATED)
    Simply-verifiable mathematical conditions for existence, uniqueness and explicit analytical computation of minimal adversarial paths (MAP) and minimal adversarial distances (MAD) for (locally) uniquely-invertible classifiers, for generalized linear models (GLM), and for entropic AI (EAI) are formulated and proven. Practical computation of MAP and MAD, their comparison and interpretations for various classes of AI tools (for neuronal networks, boosted random forests, GLM and EAI) are demonstrated on the common synthetic benchmarks: on a double Swiss roll spiral and its extensions, as well as on the two biomedical data problems (for the health insurance claim predictions, and for the heart attack lethality classification). On biomedical applications it is demonstrated how MAP provides unique minimal patient-specific risk-mitigating interventions in the predefined subsets of accessible control variables.
    Early detection of inflammatory arthritis to improve referrals using multimodal machine learning from blood testing, semi-structured and unstructured patient records. (arXiv:2310.19967v2 [cs.LG] UPDATED)
    Early detection of inflammatory arthritis (IA) is critical to efficient and accurate hospital referral triage for timely treatment and preventing the deterioration of the IA disease course, especially under limited healthcare resources. The manual assessment process is the most common approach in practice for the early detection of IA, but it is extremely labor-intensive and inefficient. A large amount of clinical information needs to be assessed for every referral from General Practice (GP) to the hospitals. Machine learning shows great potential in automating repetitive assessment tasks and providing decision support for the early detection of IA. However, most machine learning-based methods for IA detection rely on blood testing results. But in practice, blood testing data is not always available at the point of referrals, so we need methods to leverage multimodal data such as semi-structured and unstructured data for early detection of IA. In this research, we present fusion and ensemble learning-based methods using multimodal data to assist decision-making in the early detection of IA, and a conformal prediction-based method to quantify the uncertainty of the prediction and detect any unreliable predictions. To the best of our knowledge, our study is the first attempt to utilize multimodal data to support the early detection of IA from GP referrals.
    De Novo Drug Design with Joint Transformers. (arXiv:2310.02066v2 [cs.LG] UPDATED)
    De novo drug design requires simultaneously generating novel molecules outside of training data and predicting their target properties, making it a hard task for generative models. To address this, we propose Joint Transformer that combines a Transformer decoder, a Transformer encoder, and a predictor in a joint generative model with shared weights. We show that training the model with a penalized log-likelihood objective results in state-of-the-art performance in molecule generation, while decreasing the prediction error on newly sampled molecules, as compared to a fine-tuned decoder-only Transformer, by 42%. Finally, we propose a probabilistic black-box optimization algorithm that employs Joint Transformer to generate novel molecules with improved target properties, as compared to the training data, outperforming other SMILES-based optimization methods in de novo drug design.
    Causality and Independence Enhancement for Biased Node Classification. (arXiv:2310.09586v2 [cs.LG] UPDATED)
    Most existing methods that address out-of-distribution (OOD) generalization for node classification on graphs primarily focus on a specific type of data biases, such as label selection bias or structural bias. However, anticipating the type of bias in advance is extremely challenging, and designing models solely for one specific type may not necessarily improve overall generalization performance. Moreover, limited research has focused on the impact of mixed biases, which are more prevalent and demanding in real-world scenarios. To address these limitations, we propose a novel Causality and Independence Enhancement (CIE) framework, applicable to various graph neural networks (GNNs). Our approach estimates causal and spurious features at the node representation level and mitigates the influence of spurious correlations through the backdoor adjustment. Meanwhile, independence constraint is introduced to improve the discriminability and stability of causal and spurious features in complex biased environments. Essentially, CIE eliminates different types of data biases from a unified perspective, without the need to design separate methods for each bias as before. To evaluate the performance under specific types of data biases, mixed biases, and low-resource scenarios, we conducted comprehensive experiments on five publicly available datasets. Experimental results demonstrate that our approach CIE not only significantly enhances the performance of GNNs but outperforms state-of-the-art debiased node classification methods.
    Online covariance estimation for stochastic gradient descent under Markovian sampling. (arXiv:2308.01481v2 [math.ST] UPDATED)
    We investigate the online overlapping batch-means covariance estimator for Stochastic Gradient Descent (SGD) under Markovian sampling. Convergence rates of order $O\big(\sqrt{d}\,n^{-1/8}(\log n)^{1/4}\big)$ and $O\big(\sqrt{d}\,n^{-1/8}\big)$ are established under state-dependent and state-independent Markovian sampling, respectively, where $d$ is the dimensionality and $n$ denotes observations or SGD iterations. These rates match the best-known convergence rate for independent and identically distributed (i.i.d) data. Our analysis overcomes significant challenges that arise due to Markovian sampling, leading to the introduction of additional error terms and complex dependencies between the blocks of the batch-means covariance estimator. Moreover, we establish the convergence rate for the first four moments of the $\ell_2$ norm of the error of SGD dynamics under state-dependent Markovian data, which holds potential interest as an independent result. Numerical illustrations provide confidence intervals for SGD in linear and logistic regression models under Markovian sampling. Additionally, our method is applied to the strategic classification with logistic regression, where adversaries adaptively modify features during training to affect target class classification.
    PRE: Vision-Language Prompt Learning with Reparameterization Encoder. (arXiv:2309.07760v2 [cs.CV] UPDATED)
    Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks. However, to attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions. This manual prompt engineering is the major challenge for deploying such models in practice since it requires domain expertise and is extremely time-consuming. To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens. While CoOp can achieve substantial improvements over manual prompts, its learned context is worse generalizable to wider unseen classes within the same dataset. In this work, we present Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes. Instead of directly optimizing the prompts, PRE employs a prompt encoder to reparameterize the input prompt embeddings, enhancing the exploration of task-specific knowledge from few-shot samples. Experiments and extensive ablation studies on 8 benchmarks demonstrate that our approach is an efficient method for prompt learning. Specifically, PRE achieves a notable enhancement of 5.60% in average accuracy on New classes and 3% in Harmonic mean compared to CoOp in the 16-shot setting, all achieved within a good training time.
    AI Increases Global Access to Reliable Flood Forecasts. (arXiv:2307.16104v4 [cs.LG] UPDATED)
    Floods are one of the most common natural disasters, with a disproportionate impact in developing countries that often lack dense streamflow gauge networks. Accurate and timely warnings are critical for mitigating flood risks, but hydrological simulation models typically must be calibrated to long data records in each watershed. Using AI, we achieve reliability in predicting extreme riverine events in ungauged watersheds at up to a 5-day lead time that is similar to or better than the reliability of nowcasts (0-day lead time) from a current state of the art global modeling system (the Copernicus Emergency Management Service Global Flood Awareness System). Additionally, we achieve accuracies over 5-year return period events that are similar to or better than current accuracies over 1-year return period events. This means that AI can provide flood warnings earlier and over larger and more impactful events in ungauged basins. The model developed in this paper was incorporated into an operational early warning system that produces publicly available (free and open) forecasts in real time in over 80 countries. This work highlights a need for increasing the availability of hydrological data to continue to improve global access to reliable flood warnings.
    SBSM-Pro: Support Bio-sequence Machine for Proteins. (arXiv:2308.10275v2 [q-bio.QM] UPDATED)
    Proteins play a pivotal role in biological systems. The use of machine learning algorithms for protein classification can assist and even guide biological experiments, offering crucial insights for biotechnological applications. We introduce the Support Bio-Sequence Machine for Proteins (SBSM-Pro), a model purpose-built for the classification of biological sequences. This model starts with raw sequences and groups amino acids based on their physicochemical properties. It incorporates sequence alignment to measure the similarities between proteins and uses a novel multiple kernel learning (MKL) approach to integrate various types of information, utilizing support vector machines for classification prediction. The results indicate that our model demonstrates commendable performance across ten datasets in terms of the identification of protein function and posttranslational modification. This research not only exemplifies state-of-the-art work in protein classification but also paves avenues for new directions in this domain, representing a beneficial endeavor in the development of platforms tailored for the classification of biological sequences. SBSM-Pro is available for access at this http URL
    DeepACO: Neural-enhanced Ant Systems for Combinatorial Optimization. (arXiv:2309.14032v2 [cs.NE] UPDATED)
    Ant Colony Optimization (ACO) is a meta-heuristic algorithm that has been successfully applied to various Combinatorial Optimization Problems (COPs). Traditionally, customizing ACO for a specific problem requires the expert design of knowledge-driven heuristics. In this paper, we propose DeepACO, a generic framework that leverages deep reinforcement learning to automate heuristic designs. DeepACO serves to strengthen the heuristic measures of existing ACO algorithms and dispense with laborious manual design in future ACO applications. As a neural-enhanced meta-heuristic, DeepACO consistently outperforms its ACO counterparts on eight COPs using a single neural architecture and a single set of hyperparameters. As a Neural Combinatorial Optimization method, DeepACO performs better than or on par with problem-specific methods on canonical routing problems. Our code is publicly available at https://github.com/henry-yeh/DeepACO.
    Towards Robust Cardiac Segmentation using Graph Convolutional Networks. (arXiv:2310.01210v4 [eess.IV] UPDATED)
    Fully automatic cardiac segmentation can be a fast and reproducible method to extract clinical measurements from an echocardiography examination. The U-Net architecture is the current state-of-the-art deep learning architecture for medical segmentation and can segment cardiac structures in real-time with average errors comparable to inter-observer variability. However, this architecture still generates large outliers that are often anatomically incorrect. This work uses the concept of graph convolutional neural networks that predict the contour points of the structures of interest instead of labeling each pixel. We propose a graph architecture that uses two convolutional rings based on cardiac anatomy and show that this eliminates anatomical incorrect multi-structure segmentations on the publicly available CAMUS dataset. Additionally, this work contributes with an ablation study on the graph convolutional architecture and an evaluation of clinical measurements on the clinical HUNT4 dataset. Finally, we propose to use the inter-model agreement of the U-Net and the graph network as a predictor of both the input and segmentation quality. We show this predictor can detect out-of-distribution and unsuitable input images in real-time. Source code is available online: https://github.com/gillesvntnu/GCN_multistructure
    Active Learning for Semantic Segmentation with Multi-class Label Query. (arXiv:2309.09319v2 [cs.CV] UPDATED)
    This paper proposes a new active learning method for semantic segmentation. The core of our method lies in a new annotation query design. It samples informative local image regions (e.g., superpixels), and for each of such regions, asks an oracle for a multi-hot vector indicating all classes existing in the region. This multi-class labeling strategy is substantially more efficient than existing ones like segmentation, polygon, and even dominant class labeling in terms of annotation time per click. However, it introduces the class ambiguity issue in training as it assigns partial labels (i.e., a set of candidate classes) to individual pixels. We thus propose a new algorithm for learning semantic segmentation while disambiguating the partial labels in two stages. In the first stage, it trains a segmentation model directly with the partial labels through two new loss functions motivated by partial label learning and multiple instance learning. In the second stage, it disambiguates the partial labels by generating pixel-wise pseudo labels, which are used for supervised learning of the model. Equipped with a new acquisition function dedicated to the multi-class labeling, our method outperforms previous work on Cityscapes and PASCAL VOC 2012 while spending less annotation cost. Our code and results are available at https://github.com/sehyun03/MulActSeg.
    MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks. (arXiv:2309.14118v2 [cs.LG] UPDATED)
    Predicting multiple real-world tasks in a single model often requires a particularly diverse feature space. Multimodal (MM) models aim to extract the synergistic predictive potential of multiple data types to create a shared feature space with aligned semantic meaning across inputs of drastically varying sizes (i.e. images, text, sound). Most current MM architectures fuse these representations in parallel, which not only limits their interpretability but also creates a dependency on modality availability. We present MultiModN, a multimodal, modular network that fuses latent representations in a sequence of any number, combination, or type of modality while providing granular real-time predictive feedback on any number or combination of predictive tasks. MultiModN's composable pipeline is interpretable-by-design, as well as innately multi-task and robust to the fundamental issue of biased missingness. We perform four experiments on several benchmark MM datasets across 10 real-world tasks (predicting medical diagnoses, academic performance, and weather), and show that MultiModN's sequential MM fusion does not compromise performance compared with a baseline of parallel fusion. By simulating the challenging bias of missing not-at-random (MNAR), this work shows that, contrary to MultiModN, parallel fusion baselines erroneously learn MNAR and suffer catastrophic failure when faced with different patterns of MNAR at inference. To the best of our knowledge, this is the first inherently MNAR-resistant approach to MM modeling. In conclusion, MultiModN provides granular insights, robustness, and flexibility without compromising performance.
    RTDK-BO: High Dimensional Bayesian Optimization with Reinforced Transformer Deep kernels. (arXiv:2310.03912v4 [cs.LG] UPDATED)
    Bayesian Optimization (BO), guided by Gaussian process (GP) surrogates, has proven to be an invaluable technique for efficient, high-dimensional, black-box optimization, a critical problem inherent to many applications such as industrial design and scientific computing. Recent contributions have introduced reinforcement learning (RL) to improve the optimization performance on both single function optimization and \textit{few-shot} multi-objective optimization. However, even few-shot techniques fail to exploit similarities shared between closely related objectives. In this paper, we combine recent developments in Deep Kernel Learning (DKL) and attention-based Transformer models to improve the modeling powers of GP surrogates with meta-learning. We propose a novel method for improving meta-learning BO surrogates by incorporating attention mechanisms into DKL, empowering the surrogates to adapt to contextual information gathered during the BO process. We combine this Transformer Deep Kernel with a learned acquisition function trained with continuous Soft Actor-Critic Reinforcement Learning to aid in exploration. This Reinforced Transformer Deep Kernel (RTDK-BO) approach yields state-of-the-art results in continuous high-dimensional optimization problems.
    A Nonlinear Method for time series forecasting using VMD-GARCH-LSTM model. (arXiv:2310.08812v2 [stat.ME] UPDATED)
    Time series forecasting represents a significant and challenging task across various fields. Recently, methods based on mode decomposition have dominated the forecasting of complex time series because of the advantages of capturing local characteristics and extracting intrinsic modes from data. Unfortunately, most models fail to capture the implied volatilities that contain significant information. To enhance the forecasting of current, rapidly evolving, and volatile time series, we propose a novel decomposition-ensemble paradigm, the VMD-LSTM-GARCH model. The Variational Mode Decomposition algorithm is employed to decompose the time series into K sub-modes. Subsequently, the GARCH model extracts the volatility information from these sub-modes, which serve as the input for the LSTM. The numerical and volatility information of each sub-mode is utilized to train a Long Short-Term Memory network. This network predicts the sub-mode, and then we aggregate the predictions from all sub-modes to produce the output. By integrating econometric and artificial intelligence methods, and taking into account both the numerical and volatility information of the time series, our proposed model demonstrates superior performance in time series forecasting, as evidenced by the significant decrease in MSE, RMSE, and MAPE in our comparative experimental results.
    An Online Multiple Kernel Parallelizable Learning Scheme. (arXiv:2308.10101v2 [cs.LG] UPDATED)
    The performance of reproducing kernel Hilbert space-based methods is known to be sensitive to the choice of the reproducing kernel. Choosing an adequate reproducing kernel can be challenging and computationally demanding, especially in data-rich tasks without prior information about the solution domain. In this paper, we propose a learning scheme that scalably combines several single kernel-based online methods to reduce the kernel-selection bias. The proposed learning scheme applies to any task formulated as a regularized empirical risk minimization convex problem. More specifically, our learning scheme is based on a multi-kernel learning formulation that can be applied to widen any single-kernel solution space, thus increasing the possibility of finding higher-performance solutions. In addition, it is parallelizable, allowing for the distribution of the computational load across different computing units. We show experimentally that the proposed learning scheme outperforms the combined single-kernel online methods separately in terms of the cumulative regularized least squares cost metric.
    Optimal data pooling for shared learning in maintenance operations. (arXiv:2308.12670v2 [cs.LG] UPDATED)
    We study optimal data pooling for shared learning in two common maintenance operations: condition-based maintenance and spare parts management. We consider a set of systems subject to Poisson input -- the degradation or demand process -- that are coupled through an a-priori unknown rate. Decision problems involving these systems are high-dimensional Markov decision processes (MDPs) and hence notoriously difficult to solve. We present a decomposition result that reduces such an MDP to two-dimensional MDPs, enabling structural analyses and computations. Leveraging this decomposition, we (i) demonstrate that pooling data can lead to significant cost reductions compared to not pooling, and (ii) show that the optimal policy for the condition-based maintenance problem is a control limit policy, while for the spare parts management problem, it is an order-up-to level policy, both dependent on the pooled data.
    Spatial-frequency channels, shape bias, and adversarial robustness. (arXiv:2309.13190v2 [cs.LG] UPDATED)
    What spatial frequency information do humans and neural networks use to recognize objects? In neuroscience, critical band masking is an established tool that can reveal the frequency-selective filters used for object recognition. Critical band masking measures the sensitivity of recognition performance to noise added at each spatial frequency. Existing critical band masking studies show that humans recognize periodic patterns (gratings) and letters by means of a spatial-frequency filter (or "channel") that has a frequency bandwidth of one octave (doubling of frequency). Here, we introduce critical band masking as a task for network-human comparison and test 14 humans and 76 neural networks on 16-way ImageNet categorization in the presence of narrowband noise. We find that humans recognize objects in natural images using the same one-octave-wide channel that they use for letters and gratings, making it a canonical feature of human object recognition. Unlike humans, the neural network channel is very broad, 2-4 times wider than the human channel. Thus, noise at certain high and low frequencies will impair network performance and spare human performance. Adversarial and augmented-image training are commonly used to increase network robustness and shape bias. Does this training align network and human object recognition channels? Three network channel properties (bandwidth, center frequency, peak noise sensitivity) correlate strongly with shape bias (51% variance explained) and robustness of adversarially-trained networks (66% variance explained). Adversarial training increases robustness but expands the channel bandwidth even further beyond the human bandwidth. Thus, critical band masking reveals that the network channel is more than twice as wide as the human channel, and that adversarial training only makes it worse. Networks with narrower channels might be more robust.
    DEDUCE: Multi-head attention decoupled contrastive learning to discover cancer subtypes based on multi-omics data. (arXiv:2307.04075v2 [cs.LG] UPDATED)
    Due to the high heterogeneity and clinical characteristics of cancer, there are significant differences in multi-omics data and clinical features among subtypes of different cancers. Therefore, the identification and discovery of cancer subtypes are crucial for the diagnosis, treatment, and prognosis of cancer. In this study, we proposed a generalization framework based on attention mechanisms for unsupervised contrastive learning to analyze cancer multi-omics data for the identification and characterization of cancer subtypes. The framework contains a symmetric unsupervised multi-head attention encoder, which can deeply extract contextual features and long-range dependencies of multi-omics data, reducing the impact of noise in multi-omics data. Importantly, the proposed framework includes a decoupled contrastive learning model (DEDUCE) based on a multi-head attention mechanism to learn multi-omics data features and clustering and identify cancer subtypes. This method clusters subtypes by calculating the similarity between samples in the feature space and sample space of multi-omics data. The basic idea is to decouple different attributes of multi-omics data features and learn them as contrasting terms. Construct a contrastive loss function to measure the difference between positive examples and negative examples, and minimize this difference, thereby encouraging the model to learn better feature representation. The DEDUCE model conducts large-scale experiments on simulated multi-omics data sets, single-cell multi-omics data sets and cancer multi-omics data sets, and the results are better than 10 deep learning models. Finally, we used the DEDUCE model to reveal six cancer subtypes of AML. By analyzing GO functional enrichment, subtype-specific biological functions and GSEA of AML,
    Low Tensor Rank Learning of Neural Dynamics. (arXiv:2308.11567v2 [q-bio.NC] UPDATED)
    Learning relies on coordinated synaptic changes in recurrently connected populations of neurons. Therefore, understanding the collective evolution of synaptic connectivity over learning is a key challenge in neuroscience and machine learning. In particular, recent work has shown that the weight matrices of task-trained RNNs are typically low rank, but how this low rank structure unfolds over learning is unknown. To address this, we investigate the rank of the 3-tensor formed by the weight matrices throughout learning. By fitting RNNs of varying rank to large-scale neural recordings during a motor learning task, we find that the inferred weights are low-tensor-rank and therefore evolve over a fixed low-dimensional subspace throughout the entire course of learning. We next validate the observation of low-tensor-rank learning on an RNN trained to solve the same task. Finally, we present a set of mathematical results bounding the matrix and tensor ranks of gradient descent learning dynamics which show that low-tensor-rank weights emerge naturally in RNNs trained to solve low-dimensional tasks. Taken together, our findings provide insight on the evolution of population connectivity over learning in both biological and artificial neural networks, and enable reverse engineering of learning-induced changes in recurrent dynamics from large-scale neural recordings.
    HINT: Healthy Influential-Noise based Training to Defend against Data Poisoning Attacks. (arXiv:2309.08549v2 [cs.LG] UPDATED)
    While numerous defense methods have been proposed to prohibit potential poisoning attacks from untrusted data sources, most research works only defend against specific attacks, which leaves many avenues for an adversary to exploit. In this work, we propose an efficient and robust training approach to defend against data poisoning attacks based on influence functions, named Healthy Influential-Noise based Training. Using influence functions, we craft healthy noise that helps to harden the classification model against poisoning attacks without significantly affecting the generalization ability on test data. In addition, our method can perform effectively when only a subset of the training data is modified, instead of the current method of adding noise to all examples that has been used in several previous works. We conduct comprehensive evaluations over two image datasets with state-of-the-art poisoning attacks under different realistic attack scenarios. Our empirical results show that HINT can efficiently protect deep learning models against the effect of both untargeted and targeted poisoning attacks.
    PyDCM: Custom Data Center Models with Reinforcement Learning for Sustainability. (arXiv:2310.03906v5 [cs.LG] UPDATED)
    The increasing global emphasis on sustainability and reducing carbon emissions is pushing governments and corporations to rethink their approach to data center design and operation. Given their high energy consumption and exponentially large computational workloads, data centers are prime candidates for optimizing power consumption, especially in areas such as cooling and IT energy usage. A significant challenge in this pursuit is the lack of a configurable and scalable thermal data center model that offers an end-to-end pipeline. Data centers consist of multiple IT components whose geometric configuration and heat dissipation make thermal modeling difficult. This paper presents PyDCM, a customizable Data Center Model implemented in Python, that allows users to create unique configurations of IT equipment with custom server specifications and geometric arrangements of IT cabinets. The use of vectorized thermal calculations makes PyDCM orders of magnitude faster (30 times) than current Energy Plus modeling implementations and scales sublinearly with the number of CPUs. Also, PyDCM enables the use of Deep Reinforcement Learning via the Gymnasium wrapper to optimize data center cooling and offers a user-friendly platform for testing various data center design prototypes.
    Adaptive Linear Estimating Equations. (arXiv:2307.07320v2 [math.ST] UPDATED)
    Sequential data collection has emerged as a widely adopted technique for enhancing the efficiency of data gathering processes. Despite its advantages, such data collection mechanism often introduces complexities to the statistical inference procedure. For instance, the ordinary least squares (OLS) estimator in an adaptive linear regression model can exhibit non-normal asymptotic behavior, posing challenges for accurate inference and interpretation. In this paper, we propose a general method for constructing debiased estimator which remedies this issue. It makes use of the idea of adaptive linear estimating equations, and we establish theoretical guarantees of asymptotic normality, supplemented by discussions on achieving near-optimal asymptotic variance. A salient feature of our estimator is that in the context of multi-armed bandits, our estimator retains the non-asymptotic performance of the least square estimator while obtaining asymptotic normality property. Consequently, this work helps connect two fruitful paradigms of adaptive inference: a) non-asymptotic inference using concentration inequalities and b) asymptotic inference via asymptotic normality.
    On the Computational Entanglement of Distant Features in Adversarial Machine Learning. (arXiv:2309.15669v3 [cs.LG] UPDATED)
    Adversarial examples in machine learning has emerged as a focal point of research due to their remarkable ability to deceive models with seemingly inconspicuous input perturbations, potentially resulting in severe consequences. In this study, we embark on a comprehensive exploration of adversarial machine learning models, shedding light on their intrinsic complexity and interpretability. Our investigation reveals intriguing links between machine learning model complexity and Einstein's theory of special relativity, all through the lens of entanglement. While our work does not primarily center on quantum entanglement, we instead define the entanglement correlations we have discovered to be computational, and demonstrate that distant feature samples can be entangled, strongly resembling entanglement correlation in the quantum realm. This revelation bestows fresh insights for understanding the phenomenon of emergent adversarial examples in modern machine learning, potentially paving the way for more robust and interpretable models in this rapidly evolving field.
    Adaptive Data Analysis in a Balanced Adversarial Model. (arXiv:2305.15452v2 [cs.LG] UPDATED)
    In adaptive data analysis, a mechanism gets $n$ i.i.d. samples from an unknown distribution $D$, and is required to provide accurate estimations to a sequence of adaptively chosen statistical queries with respect to $D$. Hardt and Ullman (FOCS 2014) and Steinke and Ullman (COLT 2015) showed that in general, it is computationally hard to answer more than $\Theta(n^2)$ adaptive queries, assuming the existence of one-way functions. However, these negative results strongly rely on an adversarial model that significantly advantages the adversarial analyst over the mechanism, as the analyst, who chooses the adaptive queries, also chooses the underlying distribution $D$. This imbalance raises questions with respect to the applicability of the obtained hardness results -- an analyst who has complete knowledge of the underlying distribution $D$ would have little need, if at all, to issue statistical queries to a mechanism which only holds a finite number of samples from $D$. We consider more restricted adversaries, called \emph{balanced}, where each such adversary consists of two separated algorithms: The \emph{sampler} who is the entity that chooses the distribution and provides the samples to the mechanism, and the \emph{analyst} who chooses the adaptive queries, but has no prior knowledge of the underlying distribution (and hence has no a priori advantage with respect to the mechanism). We improve the quality of previous lower bounds by revisiting them using an efficient \emph{balanced} adversary, under standard public-key cryptography assumptions. We show that these stronger hardness assumptions are unavoidable in the sense that any computationally bounded \emph{balanced} adversary that has the structure of all known attacks, implies the existence of public-key cryptography.
    Guiding The Last Layer in Federated Learning with Pre-Trained Models. (arXiv:2306.03937v2 [cs.LG] UPDATED)
    Federated Learning (FL) is an emerging paradigm that allows a model to be trained across a number of participants without sharing data. Recent works have begun to consider the effects of using pre-trained models as an initialization point for existing FL algorithms; however, these approaches ignore the vast body of efficient transfer learning literature from the centralized learning setting. Here we revisit the problem of FL from a pre-trained model considered in prior work and expand it to a set of computer vision transfer learning problems. We first observe that simply fitting a linear classification head can be efficient and effective in many cases. We then show that in the FL setting, fitting a classifier using the Nearest Class Means (NCM) can be done exactly and orders of magnitude more efficiently than existing proposals, while obtaining strong performance. Finally, we demonstrate that using a two-phase approach of obtaining the classifier and then fine-tuning the model can yield rapid convergence and improved generalization in the federated setting. We demonstrate the potential our method has to reduce communication and compute costs while achieving better model performance.
    Explainable Representation Learning of Small Quantum States. (arXiv:2306.05694v3 [quant-ph] UPDATED)
    Unsupervised machine learning models build an internal representation of their training data without the need for explicit human guidance or feature engineering. This learned representation provides insights into which features of the data are relevant for the task at hand. In the context of quantum physics, training models to describe quantum states without human intervention offers a promising approach to gaining insight into how machines represent complex quantum states. The ability to interpret the learned representation may offer a new perspective on non-trivial features of quantum systems and their efficient representation. We train a generative model on two-qubit density matrices generated by a parameterized quantum circuit. In a series of computational experiments, we investigate the learned representation of the model and its internal understanding of the data. We observe that the model learns an interpretable representation which relates the quantum states to their underlying entanglement characteristics. In particular, our results demonstrate that the latent representation of the model is directly correlated with the entanglement measure concurrence. The insights from this study represent proof of concept towards interpretable machine learning of quantum states. Our approach offers insight into how machines learn to represent small-scale quantum systems autonomously.
    Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models. (arXiv:2306.09869v3 [cs.CV] UPDATED)
    Despite the remarkable performance of text-to-image diffusion models in image generation tasks, recent studies have raised the issue that generated images sometimes cannot capture the intended semantic contents of the text prompts, which phenomenon is often called semantic misalignment. To address this, here we present a novel energy-based model (EBM) framework for adaptive context control by modeling the posterior of context vectors. Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder. Then, we obtain the gradient of the log posterior of context vectors, which can be updated and transferred to the subsequent cross-attention layer, thereby implicitly minimizing a nested hierarchy of energy functions. Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts. Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing. Code: https://github.com/EnergyAttention/Energy-Based-CrossAttention.
    Towards Symmetry-Aware Generation of Periodic Materials. (arXiv:2307.02707v2 [cs.LG] UPDATED)
    We consider the problem of generating periodic materials with deep models. While symmetry-aware molecule generation has been studied extensively, periodic materials possess different symmetries, which have not been completely captured by existing methods. In this work, we propose SyMat, a novel material generation approach that can capture physical symmetries of periodic material structures. SyMat generates atom types and lattices of materials through generating atom type sets, lattice lengths and lattice angles with a variational auto-encoder model. In addition, SyMat employs a score-based diffusion model to generate atom coordinates of materials, in which a novel symmetry-aware probabilistic model is used in the coordinate diffusion process. We show that SyMat is theoretically invariant to all symmetry transformations on materials and demonstrate that SyMat achieves promising performance on random generation and property optimization tasks. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS).
    Enabling Efficient, Reliable Real-World Reinforcement Learning with Approximate Physics-Based Models. (arXiv:2307.08168v2 [cs.LG] UPDATED)
    We focus on developing efficient and reliable policy optimization strategies for robot learning with real-world data. In recent years, policy gradient methods have emerged as a promising paradigm for training control policies in simulation. However, these approaches often remain too data inefficient or unreliable to train on real robotic hardware. In this paper we introduce a novel policy gradient-based policy optimization framework which systematically leverages a (possibly highly simplified) first-principles model and enables learning precise control policies with limited amounts of real-world data. Our approach $1)$ uses the derivatives of the model to produce sample-efficient estimates of the policy gradient and $2)$ uses the model to design a low-level tracking controller, which is embedded in the policy class. Theoretical analysis provides insight into how the presence of this feedback controller overcomes key limitations of stand-alone policy gradient methods, while hardware experiments with a small car and quadruped demonstrate that our approach can learn precise control strategies reliably and with only minutes of real-world data.
    Flooding with Absorption: An Efficient Protocol for Heterogeneous Bandits over Complex Networks. (arXiv:2303.05445v3 [cs.LG] UPDATED)
    Multi-armed bandits are extensively used to model sequential decision-making, making them ubiquitous in many real-life applications such as online recommender systems and wireless networking. We consider a multi-agent setting where each agent solves their own bandit instance endowed with a different set of arms. Their goal is to minimize their group regret while collaborating via some communication protocol over a given network. Previous literature on this problem only considered arm heterogeneity and networked agents separately. In this work, we introduce a setting that encompasses both features. For this novel setting, we first provide a rigorous regret analysis for a standard flooding protocol combined with the classic UCB policy. Then, to mitigate the issue of high communication costs incurred by flooding in complex networks, we propose a new protocol called Flooding with Absorption (FwA). We provide a theoretical analysis of the resulting regret bound and discuss the advantages of using FwA over flooding. Lastly, we experimentally verify on various scenarios, including dynamic networks, that FwA leads to significantly lower communication costs despite minimal regret performance loss compared to other network protocols.
    For SALE: State-Action Representation Learning for Deep Reinforcement Learning. (arXiv:2306.02451v2 [cs.LG] UPDATED)
    In the field of reinforcement learning (RL), representation learning is a proven tool for complex image-based tasks, but is often overlooked for environments with low-level states, such as physical control problems. This paper introduces SALE, a novel approach for learning embeddings that model the nuanced interaction between state and action, enabling effective representation learning from low-level states. We extensively study the design space of these embeddings and highlight important design considerations. We integrate SALE and an adaptation of checkpoints for RL into TD3 to form the TD7 algorithm, which significantly outperforms existing continuous control algorithms. On OpenAI gym benchmark tasks, TD7 has an average performance gain of 276.7% and 50.7% over TD3 at 300k and 5M time steps, respectively, and works in both the online and offline settings.
    A Theory for Emergence of Complex Skills in Language Models. (arXiv:2307.15936v2 [cs.LG] UPDATED)
    A major driver of AI products today is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws of LLMs and a simple statistical framework. Contributions include: (a) A statistical framework that relates cross-entropy loss of LLMs to competence on the basic skills that underlie language tasks. (b) Mathematical analysis showing that the Scaling Laws imply a strong form of inductive bias that allows the pre-trained model to learn very efficiently. We informally call this {\em slingshot generalization} since naively viewed it appears to give competence levels at skills that violate usual generalization theory. (c) A key example of slingshot generalization, that competence at executing tasks involving $k$-tuples of skills emerges essentially at the same scaling and same rate as competence on the elementary skills themselves.
    PGraphDTA: Improving Drug Target Interaction Prediction using Protein Language Models and Contact Maps. (arXiv:2310.04017v2 [cs.LG] UPDATED)
    Developing and discovering new drugs is a complex and resource-intensive endeavor that often involves substantial costs, time investment, and safety concerns. A key aspect of drug discovery involves identifying novel drug-target (DT) interactions. Existing computational methods for predicting DT interactions have primarily focused on binary classification tasks, aiming to determine whether a DT pair interacts or not. However, protein-ligand interactions exhibit a continuum of binding strengths, known as binding affinity, presenting a persistent challenge for accurate prediction. In this study, we investigate various techniques employed in Drug Target Interaction (DTI) prediction and propose novel enhancements to enhance their performance. Our approaches include the integration of Protein Language Models (PLMs) and the incorporation of Contact Map information as an inductive bias within current models. Through extensive experimentation, we demonstrate that our proposed approaches outperform the baseline models considered in this study, presenting a compelling case for further development in this direction. We anticipate that the insights gained from this work will significantly narrow the search space for potential drugs targeting specific proteins, thereby accelerating drug discovery. Code and data for PGraphDTA are available at https://github.com/Yijia-Xiao/PgraphDTA/.
    HACMan: Learning Hybrid Actor-Critic Maps for 6D Non-Prehensile Manipulation. (arXiv:2305.03942v4 [cs.RO] UPDATED)
    Manipulating objects without grasping them is an essential component of human dexterity, referred to as non-prehensile manipulation. Non-prehensile manipulation may enable more complex interactions with the objects, but also presents challenges in reasoning about gripper-object interactions. In this work, we introduce Hybrid Actor-Critic Maps for Manipulation (HACMan), a reinforcement learning approach for 6D non-prehensile manipulation of objects using point cloud observations. HACMan proposes a temporally-abstracted and spatially-grounded object-centric action representation that consists of selecting a contact location from the object point cloud and a set of motion parameters describing how the robot will move after making contact. We modify an existing off-policy RL algorithm to learn in this hybrid discrete-continuous action representation. We evaluate HACMan on a 6D object pose alignment task in both simulation and in the real world. On the hardest version of our task, with randomized initial poses, randomized 6D goals, and diverse object categories, our policy demonstrates strong generalization to unseen object categories without a performance drop, achieving an 89% success rate on unseen objects in simulation and 50% success rate with zero-shot transfer in the real world. Compared to alternative action representations, HACMan achieves a success rate more than three times higher than the best baseline. With zero-shot sim2real transfer, our policy can successfully manipulate unseen objects in the real world for challenging non-planar goals, using dynamic and contact-rich non-prehensile skills. Videos can be found on the project website: https://hacman-2023.github.io.
    Finding Counterfactually Optimal Action Sequences in Continuous State Spaces. (arXiv:2306.03929v2 [cs.LG] UPDATED)
    Whenever a clinician reflects on the efficacy of a sequence of treatment decisions for a patient, they may try to identify critical time steps where, had they made different decisions, the patient's health would have improved. While recent methods at the intersection of causal inference and reinforcement learning promise to aid human experts, as the clinician above, to retrospectively analyze sequential decision making processes, they have focused on environments with finitely many discrete states. However, in many practical applications, the state of the environment is inherently continuous in nature. In this paper, we aim to fill this gap. We start by formally characterizing a sequence of discrete actions and continuous states using finite horizon Markov decision processes and a broad class of bijective structural causal models. Building upon this characterization, we formalize the problem of finding counterfactually optimal action sequences and show that, in general, we cannot expect to solve it in polynomial time. Then, we develop a search method based on the $A^*$ algorithm that, under a natural form of Lipschitz continuity of the environment's dynamics, is guaranteed to return the optimal solution to the problem. Experiments on real clinical data show that our method is very efficient in practice, and it has the potential to offer interesting insights for sequential decision making tasks.
    Learning Large Graph Property Prediction via Graph Segment Training. (arXiv:2305.12322v3 [cs.LG] UPDATED)
    Learning to predict properties of large graphs is challenging because each prediction requires the knowledge of an entire graph, while the amount of memory available during training is bounded. Here we propose Graph Segment Training (GST), a general framework that utilizes a divide-and-conquer approach to allow learning large graph property prediction with a constant memory footprint. GST first divides a large graph into segments and then backpropagates through only a few segments sampled per training iteration. We refine the GST paradigm by introducing a historical embedding table to efficiently obtain embeddings for segments not sampled for backpropagation. To mitigate the staleness of historical embeddings, we design two novel techniques. First, we finetune the prediction head to fix the input distribution shift. Second, we introduce Stale Embedding Dropout to drop some stale embeddings during training to reduce bias. We evaluate our complete method GST-EFD (with all the techniques together) on two large graph property prediction benchmarks: MalNet and TpuGraphs. Our experiments show that GST-EFD is both memory-efficient and fast, while offering a slight boost on test accuracy over a typical full graph training regime.
    Active Vision Reinforcement Learning under Limited Visual Observability. (arXiv:2306.00975v2 [cs.LG] UPDATED)
    In this work, we investigate Active Vision Reinforcement Learning (ActiveVision-RL), where an embodied agent simultaneously learns action policy for the task while also controlling its visual observations in partially observable environments. We denote the former as motor policy and the latter as sensory policy. For example, humans solve real world tasks by hand manipulation (motor policy) together with eye movements (sensory policy). ActiveVision-RL poses challenges on coordinating two policies given their mutual influence. We propose SUGARL, Sensorimotor Understanding Guided Active Reinforcement Learning, a framework that models motor and sensory policies separately, but jointly learns them using with an intrinsic sensorimotor reward. This learnable reward is assigned by sensorimotor reward module, incentivizes the sensory policy to select observations that are optimal to infer its own motor action, inspired by the sensorimotor stage of humans. Through a series of experiments, we show the effectiveness of our method across a range of observability conditions and its adaptability to existed RL algorithms. The sensory policies learned through our method are observed to exhibit effective active vision strategies.
    Uncertainty Quantification via Neural Posterior Principal Components. (arXiv:2309.15533v2 [cs.CV] UPDATED)
    Uncertainty quantification is crucial for the deployment of image restoration models in safety-critical domains, like autonomous driving and biological imaging. To date, methods for uncertainty visualization have mainly focused on per-pixel estimates. Yet, a heatmap of per-pixel variances is typically of little practical use, as it does not capture the strong correlations between pixels. A more natural measure of uncertainty corresponds to the variances along the principal components (PCs) of the posterior distribution. Theoretically, the PCs can be computed by applying PCA on samples generated from a conditional generative model for the input image. However, this requires generating a very large number of samples at test time, which is painfully slow with the current state-of-the-art (diffusion) models. In this work, we present a method for predicting the PCs of the posterior distribution for any input image, in a single forward pass of a neural network. Our method can either wrap around a pre-trained model that was trained to minimize the mean square error (MSE), or can be trained from scratch to output both a predicted image and the posterior PCs. We showcase our method on multiple inverse problems in imaging, including denoising, inpainting, super-resolution, and biological image-to-image translation. Our method reliably conveys instance-adaptive uncertainty directions, achieving uncertainty quantification comparable with posterior samplers while being orders of magnitude faster. Code and examples are available at https://eliasnehme.github.io/NPPC/
    DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training. (arXiv:2310.02025v2 [cs.LG] UPDATED)
    Zeroth-order (ZO) optimization has become a popular technique for solving machine learning (ML) problems when first-order (FO) information is difficult or impossible to obtain. However, the scalability of ZO optimization remains an open problem: Its use has primarily been limited to relatively small-scale ML problems, such as sample-wise adversarial attack generation. To our best knowledge, no prior work has demonstrated the effectiveness of ZO optimization in training deep neural networks (DNNs) without a significant decrease in performance. To overcome this roadblock, we develop DeepZero, a principled ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch through three primary innovations. First, we demonstrate the advantages of coordinate-wise gradient estimation (CGE) over randomized vector-wise gradient estimation in training accuracy and computational efficiency. Second, we propose a sparsity-induced ZO training protocol that extends the model pruning methodology using only finite differences to explore and exploit the sparse DL prior in CGE. Third, we develop the methods of feature reuse and forward parallelization to advance the practical implementations of ZO training. Our extensive experiments show that DeepZero achieves state-of-the-art (SOTA) accuracy on ResNet-20 trained on CIFAR-10, approaching FO training performance for the first time. Furthermore, we show the practical utility of DeepZero in applications of certified adversarial defense and DL-based partial differential equation error correction, achieving 10-20% improvement over SOTA. We believe our results will inspire future research on scalable ZO optimization and contribute to advancing DL with black box.
    ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models. (arXiv:2310.18208v2 [cs.CL] UPDATED)
    Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type and incur large run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark. Our code is available at https://github.com/penfever/ArcheType.
    A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence. (arXiv:2301.13139v3 [stat.ML] UPDATED)
    Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially in the tabular setting, the use of general parameterization schemes remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parameterizations. The policy class induced by our scheme recovers known classes, e.g., softmax, and generates new ones depending on the choice of mirror map. Using our framework, we obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization. To demonstrate the ability of our framework to accommodate general parameterization schemes, we provide its sample complexity when using shallow neural networks, show that it represents an improvement upon the previous best results, and empirically validate the effectiveness of our theoretical claims on classic control tasks.
    RECKONING: Reasoning through Dynamic Knowledge Encoding. (arXiv:2305.06349v3 [cs.CL] UPDATED)
    Recent studies on transformer-based language models show that they can answer questions by reasoning over knowledge provided as part of the context (i.e., in-context reasoning). However, since the available knowledge is often not filtered for a particular question, in-context reasoning can be sensitive to distractor facts, additional content that is irrelevant to a question but that may be relevant for a different question (i.e., not necessarily random noise). In these situations, the model fails to distinguish the knowledge that is necessary to answer the question, leading to spurious reasoning and degraded performance. This reasoning failure contrasts with the model's apparent ability to distinguish its contextual knowledge from all the knowledge it has memorized during pre-training. Following this observation, we propose teaching the model to reason more robustly by folding the provided contextual knowledge into the model's parameters before presenting it with a question. Our method, RECKONING, is a bi-level learning algorithm that teaches language models to reason by updating their parametric knowledge through back-propagation, allowing them to then answer questions using the updated parameters. During training, the inner loop rapidly adapts a copy of the model weights to encode contextual knowledge into its parameters. In the outer loop, the model learns to use the updated weights to reproduce and answer reasoning questions about the memorized knowledge. Our experiments on two multi-hop reasoning datasets show that RECKONING's performance improves over the in-context reasoning baseline (by up to 4.5%). We also find that compared to in-context reasoning, RECKONING generalizes better to longer reasoning chains unseen during training, is more robust to distractors in the context, and is more computationally efficient when multiple questions are asked about the same knowledge.
    Disentangled Representation Learning with Large Language Models for Text-Attributed Graphs. (arXiv:2310.18152v2 [cs.CL] UPDATED)
    Text-attributed graphs (TAGs) are prevalent on the web and research over TAGs such as citation networks, e-commerce networks and social networks has attracted considerable attention in the web community. Recently, large language models (LLMs) have demonstrated exceptional capabilities across a wide range of tasks. However, the existing works focus on harnessing the potential of LLMs solely relying on prompts to convey graph structure information to LLMs, thus suffering from insufficient understanding of the complex structural relationships within TAGs. To address this problem, in this paper we present the Disentangled Graph-Text Learner (DGTL) model, which is able to enhance the reasoning and predicting capabilities of LLMs for TAGs. Our proposed DGTL model incorporates graph structure information through tailored disentangled graph neural network (GNN) layers, enabling LLMs to capture the intricate relationships hidden in text-attributed graphs from multiple structural factors. Furthermore, DGTL operates with frozen pre-trained LLMs, reducing computational costs and allowing much more flexibility in combining with different LLM models. Experimental evaluations demonstrate the effectiveness of the proposed DGTL model on achieving superior or comparable performance over state-of-the-art baselines. Additionally, we also demonstrate that our DGTL model can offer natural language explanations for predictions, thereby significantly enhancing model interpretability.
    Crop Disease Classification using Support Vector Machines with Green Chromatic Coordinate (GCC) and Attention based feature extraction for IoT based Smart Agricultural Applications. (arXiv:2311.00429v2 [eess.IV] UPDATED)
    Crops hold paramount significance as they serve as the primary provider of energy, nutrition, and medicinal benefits for the human population. Plant diseases, however, can negatively affect leaves during agricultural cultivation, resulting in significant losses in crop output and economic value. Therefore, it is crucial for farmers to identify crop diseases. However, this method frequently necessitates hard work, a lot of planning, and in-depth familiarity with plant pathogens. Given these numerous obstacles, it is essential to provide solutions that can easily interface with mobile and IoT devices so that our farmers can guarantee the best possible crop development. Various machine learning (ML) as well as deep learning (DL) algorithms have been created & studied for the identification of plant disease detection, yielding substantial and promising results. This article presents a novel classification method that builds on prior work by utilising attention-based feature extraction, RGB channel-based chromatic analysis, Support Vector Machines (SVM) for improved performance, and the ability to integrate with mobile applications and IoT devices after quantization of information. Several disease classification algorithms were compared with the suggested model, and it was discovered that, in terms of accuracy, Vision Transformer-based feature extraction and additional Green Chromatic Coordinate feature with SVM classification achieved an accuracy of (GCCViT-SVM) - 99.69%, whereas after quantization for IoT device integration achieved an accuracy of - 97.41% while almost reducing 4x in size. Our findings have profound implications because they have the potential to transform how farmers identify crop illnesses with precise and fast information, thereby preserving agricultural output and ensuring food security.
    Bandit Social Learning: Exploration under Myopic Behavior. (arXiv:2302.07425v4 [cs.GT] UPDATED)
    We study social learning dynamics motivated by reviews on online platforms. The agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration. We allow a wide range of myopic behaviors that are consistent with (parameterized) confidence intervals for the arms' expected rewards. We derive stark learning failures for any such behavior, and provide matching positive results. As a special case, we obtain the first general results on failure of the greedy algorithm in bandits, thus providing a theoretical foundation for why bandit algorithms should explore.
    Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding. (arXiv:2303.12513v2 [cs.CV] UPDATED)
    Most humans use visual imagination to understand and reason about language, but models such as BERT reason about language using knowledge acquired during text-only pretraining. In this work, we investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning, focusing primarily on zero-shot probing methods. We propose a suite of visual language understanding (VLU) tasks for probing the visual reasoning abilities of text encoder models, as well as various non-visual natural language understanding (NLU) tasks for comparison. We also contribute a novel zero-shot knowledge probing method, Stroop probing, for applying models such as CLIP to text-only tasks without needing a prediction head such as the masked language modelling head of models like BERT. We show that SOTA multimodally trained text encoders outperform unimodally trained text encoders on the VLU tasks while being underperformed by them on the NLU tasks, lending new context to previously mixed results regarding the NLU capabilities of multimodal models. We conclude that exposure to images during pretraining affords inherent visual reasoning knowledge that is reflected in language-only tasks that require implicit visual reasoning. Our findings bear importance in the broader context of multimodal learning, providing principled guidelines for the choice of text encoders used in such contexts.
    An $\varepsilon$-Best-Arm Identification Algorithm for Fixed-Confidence and Beyond. (arXiv:2305.16041v2 [stat.ML] UPDATED)
    We propose EB-TC$\varepsilon$, a novel sampling rule for $\varepsilon$-best arm identification in stochastic bandits. It is the first instance of Top Two algorithm analyzed for approximate best arm identification. EB-TC$\varepsilon$ is an *anytime* sampling rule that can therefore be employed without modification for fixed confidence or fixed budget identification (without prior knowledge of the budget). We provide three types of theoretical guarantees for EB-TC$\varepsilon$. First, we prove bounds on its expected sample complexity in the fixed confidence setting, notably showing its asymptotic optimality in combination with an adaptive tuning of its exploration parameter. We complement these findings with upper bounds on its probability of error at any time and for any error parameter, which further yield upper bounds on its simple regret at any time. Finally, we show through numerical simulations that EB-TC$\varepsilon$ performs favorably compared to existing algorithms, in different settings.
    NashFormer: Leveraging Local Nash Equilibria for Semantically Diverse Trajectory Prediction. (arXiv:2305.17600v2 [cs.LG] UPDATED)
    Interactions between road agents present a significant challenge in trajectory prediction, especially in cases involving multiple agents. Because existing diversity-aware predictors do not account for the interactive nature of multi-agent predictions, they may miss these important interaction outcomes. In this paper, we propose NashFormer, a framework for trajectory prediction that leverages game-theoretic inverse reinforcement learning to improve coverage of multi-modal predictions. We use a training-time game-theoretic analysis as an auxiliary loss resulting in improved coverage and accuracy without presuming a taxonomy of actions for the agents. We demonstrate our approach on the interactive split of the Waymo Open Motion Dataset, including four subsets involving scenarios with high interaction complexity. Experiment results show that our predictor produces accurate predictions while covering $33\%$ more potential interactions versus a baseline model.
    A physics-informed and attention-based graph learning approach for regional electric vehicle charging demand prediction. (arXiv:2309.05259v2 [cs.LG] UPDATED)
    Along with the proliferation of electric vehicles (EVs), optimizing the use of EV charging space can significantly alleviate the growing load on intelligent transportation systems. As the foundation to achieve such an optimization, a spatiotemporal method for EV charging demand prediction in urban areas is required. Although several solutions have been proposed by using data-driven deep learning methods, it can be found that these performance-oriented methods may suffer from misinterpretations to correctly handle the reverse relationship between charging demands and prices. To tackle the emerging challenges of training an accurate and interpretable prediction model, this paper proposes a novel approach that enables the integration of graph and temporal attention mechanisms for feature extraction and the usage of physic-informed meta-learning in the model pre-training step for knowledge transfer. Evaluation results on a dataset of 18,013 EV charging piles in Shenzhen, China, show that the proposed approach, named PAG, can achieve state-of-the-art forecasting performance and the ability in understanding the adaptive changes in charging demands caused by price fluctuations.
    Exact Generalization Guarantees for (Regularized) Wasserstein Distributionally Robust Models. (arXiv:2305.17076v2 [cs.LG] UPDATED)
    Wasserstein distributionally robust estimators have emerged as powerful models for prediction and decision-making under uncertainty. These estimators provide attractive generalization guarantees: the robust objective obtained from the training distribution is an exact upper bound on the true risk with high probability. However, existing guarantees either suffer from the curse of dimensionality, are restricted to specific settings, or lead to spurious error terms. In this paper, we show that these generalization guarantees actually hold on general classes of models, do not suffer from the curse of dimensionality, and can even cover distribution shifts at testing. We also prove that these results carry over to the newly-introduced regularized versions of Wasserstein distributionally robust problems.
    Sparse Modular Activation for Efficient Sequence Modeling. (arXiv:2306.11197v4 [cs.LG] UPDATED)
    Recent hybrid models combining Linear State Space Models (SSMs) with self-attention mechanisms have demonstrated impressive results across a range of sequence modeling tasks. However, current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs. To address this limitation, we introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely and dynamically activate sub-modules for sequence elements in a differentiable manner. Through allowing each element to skip non-activated sub-modules, SMA reduces computation and memory consumption of neural networks at both training and inference stages. To validate the effectiveness of SMA on sequence modeling, we design a novel neural architecture, SeqBoat, which employs SMA to sparsely activate a Gated Attention Unit (GAU) based on the state representations learned from an SSM. By constraining the GAU to only conduct local attention on the activated inputs, SeqBoat can achieve linear inference complexity with theoretically infinite attention span, and provide substantially better quality-efficiency trade-off than the chunking-based models. With experiments on a wide range of tasks, including long sequence modeling, speech classification and language modeling, SeqBoat brings new state-of-the-art results among hybrid models with linear complexity, and reveals the amount of attention needed for each task through the learned sparse activation patterns. Our code is publicly available at https://github.com/renll/SeqBoat.
    Optimizing Retrieval-augmented Reader Models via Token Elimination. (arXiv:2310.13682v2 [cs.CL] UPDATED)
    Fusion-in-Decoder (FiD) is an effective retrieval-augmented language model applied across a variety of open-domain tasks, such as question answering, fact checking, etc. In FiD, supporting passages are first retrieved and then processed using a generative model (Reader), which can cause a significant bottleneck in decoding time, particularly with long outputs. In this work, we analyze the contribution and necessity of all the retrieved passages to the performance of reader models, and propose eliminating some of the retrieved information, at the token level, that might not contribute essential information to the answer generation process. We demonstrate that our method can reduce run-time by up to 62.2%, with only a 2% reduction in performance, and in some cases, even improve the performance results.
    NODE-ImgNet: a PDE-informed effective and robust model for image denoising. (arXiv:2305.11049v2 [eess.IV] UPDATED)
    Inspired by the traditional partial differential equation (PDE) approach for image denoising, we propose a novel neural network architecture, referred as NODE-ImgNet, that combines neural ordinary differential equations (NODEs) with convolutional neural network (CNN) blocks. NODE-ImgNet is intrinsically a PDE model, where the dynamic system is learned implicitly without the explicit specification of the PDE. This naturally circumvents the typical issues associated with introducing artifacts during the learning process. By invoking such a NODE structure, which can also be viewed as a continuous variant of a residual network (ResNet) and inherits its advantage in image denoising, our model achieves enhanced accuracy and parameter efficiency. In particular, our model exhibits consistent effectiveness in different scenarios, including denoising gray and color images perturbed by Gaussian noise, as well as real-noisy images, and demonstrates superiority in learning from small image datasets.
    ProtoryNet - Interpretable Text Classification Via Prototype Trajectories. (arXiv:2007.01777v5 [cs.LG] UPDATED)
    We propose a novel interpretable deep neural network for text classification, called ProtoryNet, based on a new concept of prototype trajectories. Motivated by the prototype theory in modern linguistics, ProtoryNet makes a prediction by finding the most similar prototype for each sentence in a text sequence and feeding an RNN backbone with the proximity of each sentence to the corresponding active prototype. The RNN backbone then captures the temporal pattern of the prototypes, which we refer to as prototype trajectories. Prototype trajectories enable intuitive and fine-grained interpretation of the reasoning process of the RNN model, in resemblance to how humans analyze texts. We also design a prototype pruning procedure to reduce the total number of prototypes used by the model for better interpretability. Experiments on multiple public data sets show that ProtoryNet is more accurate than the baseline prototype-based deep neural net and reduces the performance gap compared to state-of-the-art black-box models. In addition, after prototype pruning, the resulting ProtoryNet models only need less than or around 20 prototypes for all datasets, which significantly benefits interpretability. Furthermore, we report a survey result indicating that human users find ProtoryNet more intuitive and easier to understand than other prototype-based methods.
    Detecting hidden confounding in observational data using multiple environments. (arXiv:2205.13935v4 [stat.ME] UPDATED)
    A common assumption in causal inference from observational data is that there is no hidden confounding. Yet it is, in general, impossible to verify this assumption from a single dataset. Under the assumption of independent causal mechanisms underlying the data-generating process, we demonstrate a way to detect unobserved confounders when having multiple observational datasets coming from different environments. We present a theory for testable conditional independencies that are only absent when there is hidden confounding and examine cases where we violate its assumptions: degenerate & dependent mechanisms, and faithfulness violations. Additionally, we propose a procedure to test these independencies and study its empirical finite-sample behavior using simulation studies and semi-synthetic data based on a real-world dataset. In most cases, the proposed procedure correctly predicts the presence of hidden confounding, particularly when the confounding bias is large.
    Mutual Information Regularized Offline Reinforcement Learning. (arXiv:2210.07484v2 [cs.LG] UPDATED)
    The major challenge of offline RL is the distribution shift that appears when out-of-distribution actions are queried, which makes the policy improvement direction biased by extrapolation errors. Most existing methods address this problem by penalizing the policy or value for deviating from the behavior policy during policy improvement or evaluation. In this work, we propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset by directly constraining the policy improvement direction. MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. Hence, we constrain the policy improvement direction to lie in the data manifold. The resulting algorithm simultaneously augments the policy evaluation and improvement by adding mutual information regularizations. MISA is a general framework that unifies conservative Q-learning (CQL) and behavior regularization methods (e.g., TD3+BC) as special cases. We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance. In addition, our extensive experiments show MISA significantly outperforms a wide range of baselines on various tasks of the D4RL benchmark,e.g., achieving 742.9 total points on gym-locomotion tasks. Our code is available at https://github.com/sail-sg/MISA.
    Are you using test log-likelihood correctly?. (arXiv:2212.00219v3 [stat.ML] UPDATED)
    Test log-likelihood is commonly used to compare different models of the same data or different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error.
    Training Matters: Unlocking Potentials of Deeper Graph Convolutional Neural Networks. (arXiv:2008.08838v3 [cs.LG] UPDATED)
    The performance limit of Graph Convolutional Networks (GCNs) and the fact that we cannot stack more of them to increase the performance, which we usually do for other deep learning paradigms, are pervasively thought to be caused by the limitations of the GCN layers, including insufficient expressive power, etc. However, if so, for a fixed architecture, it would be unlikely to lower the training difficulty and to improve performance by changing only the training procedure, which we show in this paper not only possible but possible in several ways. This paper first identify the training difficulty of GCNs from the perspective of graph signal energy loss. More specifically, we find that the loss of energy in the backward pass during training nullifies the learning of the layers closer to the input. Then, we propose several methodologies to mitigate the training problem by slightly modifying the GCN operator, from the energy perspective. After empirical validation, we confirm that these changes of operator lead to significant decrease in the training difficulties and notable performance boost, without changing the composition of parameters. With these, we conclude that the root cause of the problem is more likely the training difficulty than the others.
    Monotone Learning. (arXiv:2202.05246v3 [cs.LG] UPDATED)
    The amount of training-data is one of the key factors which determines the generalization capacity of learning algorithms. Intuitively, one expects the error rate to decrease as the amount of training-data increases. Perhaps surprisingly, natural attempts to formalize this intuition give rise to interesting and challenging mathematical questions. For example, in their classical book on pattern recognition, Devroye, Gyorfi, and Lugosi (1996) ask whether there exists a {monotone} Bayes-consistent algorithm. This question remained open for over 25 years, until recently Pestov (2021) resolved it for binary classification, using an intricate construction of a monotone Bayes-consistent algorithm. We derive a general result in multiclass classification, showing that every learning algorithm A can be transformed to a monotone one with similar performance. Further, the transformation is efficient and only uses a black-box oracle access to A. This demonstrates that one can provably avoid non-monotonic behaviour without compromising performance, thus answering questions asked by Devroye et al (1996), Viering, Mey, and Loog (2019), Viering and Loog (2021), and by Mhammedi (2021). Our transformation readily implies monotone learners in a variety of contexts: for example it extends Pestov's result to classification tasks with an arbitrary number of labels. This is in contrast with Pestov's work which is tailored to binary classification. In addition, we provide uniform bounds on the error of the monotone algorithm. This makes our transformation applicable in distribution-free settings. For example, in PAC learning it implies that every learnable class admits a monotone PAC learner. This resolves questions by Viering, Mey, and Loog (2019); Viering and Loog (2021); Mhammedi (2021).
    A Contrastive Approach to Online Change Point Detection. (arXiv:2206.10143v3 [stat.ML] UPDATED)
    We suggest a novel procedure for online change point detection. Our approach expands an idea of maximizing a discrepancy measure between points from pre-change and post-change distributions. This leads to a flexible procedure suitable for both parametric and nonparametric scenarios. We prove non-asymptotic bounds on the average running length of the procedure and its expected detection delay. The efficiency of the algorithm is illustrated with numerical experiments on synthetic and real-world data sets.
    Multi-scale data reconstruction of turbulent rotating flows with Gappy POD, Extended POD and Generative Adversarial Networks. (arXiv:2210.11921v2 [physics.flu-dyn] UPDATED)
    Data reconstruction of rotating turbulent snapshots is investigated utilizing data-driven tools. This problem is crucial for numerous geophysical applications and fundamental aspects, given the concurrent effects of direct and inverse energy cascades, which lead to non-Gaussian statistics at both large and small scales. Data assimilation also serves as a tool to rank physical features within turbulence, by evaluating the performance of reconstruction in terms of the quality and quantity of the information used. Additionally, benchmarking various reconstruction techniques is essential to assess the trade-off between quantitative supremacy, implementation complexity, and explicability. In this study, we use linear and non-linear tools based on the Proper Orthogonal Decomposition (POD) and Generative Adversarial Network (GAN) for reconstructing rotating turbulence snapshots with spatial damages (inpainting). We focus on accurately reproducing both statistical properties and instantaneous velocity fields. Different gap sizes and gap geometries are investigated in order to assess the importance of coherency and multi-scale properties of the missing information. Surprisingly enough, concerning point-wise reconstruction, the non-linear GAN does not outperform one of the linear POD techniques. On the other hand, supremacy of the GAN approach is shown when the statistical multi-scale properties are compared. Similarly, extreme events in the gap region are better predicted when using GAN. The balance between point-wise error and statistical properties is controlled by the adversarial ratio, which determines the relative importance of the generator and the discriminator in the GAN training. Robustness against the measurement noise is also discussed.
    When Do We Need Graph Neural Networks for Node Classification?. (arXiv:2210.16979v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) extend basic Neural Networks (NNs) by additionally making use of graph structure based on the relational inductive bias (edge bias), rather than treating the nodes as collections of independent and identically distributed (i.i.d.) samples. Though GNNs are believed to outperform basic NNs in real-world tasks, it is found that in some cases, GNNs have little performance gain or even underperform graph-agnostic NNs. To identify these cases, based on graph signal processing and statistical hypothesis testing, we propose two measures which analyze the cases in which the edge bias in features and labels does not provide advantages. Based on the measures, a threshold value can be given to predict the potential performance advantages of graph-aware models over graph-agnostic models.
    Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes. (arXiv:2212.06132v3 [cs.LG] UPDATED)
    We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous linear Markov decision processes (linear MDPs) whose transition probability can be parameterized as a linear function of a given feature mapping, we propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $K$ is the number of episodes. Our algorithm is based on a weighted linear regression scheme with a carefully designed weight, which depends on a new variance estimator that (1) directly estimates the variance of the optimal value function, (2) monotonically decreases with respect to the number of episodes to ensure a better estimation accuracy, and (3) uses a rare-switching policy to update the value function estimator to control the complexity of the estimated value function class. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
    Comparative Knowledge Distillation. (arXiv:2311.02253v1 [cs.LG])
    In the era of large scale pretrained models, Knowledge Distillation (KD) serves an important role in transferring the wisdom of computationally heavy teacher models to lightweight, efficient student models while preserving performance. Traditional KD paradigms, however, assume readily available access to teacher models for frequent inference -- a notion increasingly at odds with the realities of costly, often proprietary, large scale models. Addressing this gap, our paper considers how to minimize the dependency on teacher model inferences in KD in a setting we term Few Teacher Inference Knowledge Distillation (FTI KD). We observe that prevalent KD techniques and state of the art data augmentation strategies fall short in this constrained setting. Drawing inspiration from educational principles that emphasize learning through comparison, we propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples. Critically, CKD provides additional learning signals to the student without making additional teacher calls. We also extend the principle of CKD to groups of samples, enabling even more efficient learning from limited teacher calls. Empirical evaluation across varied experimental settings indicates that CKD consistently outperforms state of the art data augmentation and KD techniques.
    Approximating CKY with Transformers. (arXiv:2305.02386v2 [cs.CL] UPDATED)
    We investigate the ability of transformer models to approximate the CKY algorithm, using them to directly predict a sentence's parse and thus avoid the CKY algorithm's cubic dependence on sentence length. We find that on standard constituency parsing benchmarks this approach achieves competitive or better performance than comparable parsers that make use of CKY, while being faster. We also evaluate the viability of this approach for parsing under \textit{random} PCFGs. Here we find that performance declines as the grammar becomes more ambiguous, suggesting that the transformer is not fully capturing the CKY computation. However, we also find that incorporating additional inductive bias is helpful, and we propose a novel approach that makes use of gradients with respect to chart representations in predicting the parse, in analogy with the CKY algorithm being a subgradient of a partition function variant with respect to the chart.
    Recommender Systems with Generative Retrieval. (arXiv:2305.05065v3 [cs.IR] UPDATED)
    Modern recommender systems perform large-scale retrieval by first embedding queries and item candidates in the same unified space, followed by approximate nearest neighbor search to select top candidates given a query embedding. In this paper, we propose a novel generative retrieval approach, where the retrieval model autoregressively decodes the identifiers of the target candidates. To that end, we create semantically meaningful tuple of codewords to serve as a Semantic ID for each item. Given Semantic IDs for items in a user session, a Transformer-based sequence-to-sequence model is trained to predict the Semantic ID of the next item that the user will interact with. To the best of our knowledge, this is the first Semantic ID-based generative model for recommendation tasks. We show that recommender systems trained with the proposed paradigm significantly outperform the current SOTA models on various datasets. In addition, we show that incorporating Semantic IDs into the sequence-to-sequence model enhances its ability to generalize, as evidenced by the improved retrieval performance observed for items with no prior interaction history.
    Regularized Linear Regression for Binary Classification. (arXiv:2311.02270v1 [cs.LG])
    Regularized linear regression is a promising approach for binary classification problems in which the training set has noisy labels since the regularization term can help to avoid interpolating the mislabeled data points. In this paper we provide a systematic study of the effects of the regularization strength on the performance of linear classifiers that are trained to solve binary classification problems by minimizing a regularized least-squares objective. We consider the over-parametrized regime and assume that the classes are generated from a Gaussian Mixture Model (GMM) where a fraction $c<\frac{1}{2}$ of the training data is mislabeled. Under these assumptions, we rigorously analyze the classification errors resulting from the application of ridge, $\ell_1$, and $\ell_\infty$ regression. In particular, we demonstrate that ridge regression invariably improves the classification error. We prove that $\ell_1$ regularization induces sparsity and observe that in many cases one can sparsify the solution by up to two orders of magnitude without any considerable loss of performance, even though the GMM has no underlying sparsity structure. For $\ell_\infty$ regularization we show that, for large enough regularization strength, the optimal weights concentrate around two values of opposite sign. We observe that in many cases the corresponding "compression" of each weight to a single bit leads to very little loss in performance. These latter observations can have significant practical ramifications.
    Detection and Localization of Melanoma Skin Cancer in Histopathological Whole Slide Images. (arXiv:2302.03014v4 [eess.IV] UPDATED)
    Melanoma diagnosed and treated in its early stages can increase the survival rate. A projected increase in skin cancer incidents and a dearth of dermatopathologists have emphasized the need for computational pathology (CPATH) systems. CPATH systems with deep learning (DL) models have the potential to identify the presence of melanoma by exploiting underlying morphological and cellular features. This paper proposes a DL method to detect melanoma and distinguish between normal skin and benign/malignant melanocytic lesions in Whole Slide Images (WSI). Our method detects lesions with high accuracy and localizes them on a WSI to identify potential regions of interest for pathologists. Interestingly, our DL method relies on using a single CNN network to create localization maps first and use them to perform slide-level predictions to determine patients who have melanoma. Our best model provides favorable patch-wise classification results with a 0.992 F1 score and 0.99 sensitivity on unseen data. The source code is https://github.com/RogerAmundsen/Melanoma-Diagnosis-and-Localization-from-Whole-Slide-Images-using-Convolutional-Neural-Networks.
    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions. (arXiv:2303.08797v3 [cs.LG] UPDATED)
    A class of generative models that unifies flow-based and diffusion-based methods is introduced. These models extend the framework proposed in Albergo & Vanden-Eijnden (2023), enabling the use of a broad class of continuous-time stochastic processes called `stochastic interpolants' to bridge any two arbitrary probability density functions exactly in finite time. These interpolants are built by combining data from the two prescribed densities with an additional latent variable that shapes the bridge in a flexible way. The time-dependent probability density function of the stochastic interpolant is shown to satisfy a first-order transport equation as well as a family of forward and backward Fokker-Planck equations with tunable diffusion coefficient. Upon consideration of the time evolution of an individual sample, this viewpoint immediately leads to both deterministic and stochastic generative models based on probability flow equations or stochastic differential equations with an adjustable level of noise. The drift coefficients entering these models are time-dependent velocity fields characterized as the unique minimizers of simple quadratic objective functions, one of which is a new objective for the score of the interpolant density. We show that minimization of these quadratic objectives leads to control of the likelihood for generative models built upon stochastic dynamics, while likelihood control for deterministic dynamics is more stringent. We also discuss connections with other methods such as score-based diffusion models, stochastic localization processes, probabilistic denoising techniques, and rectifying flows. In addition, we demonstrate that stochastic interpolants recover the Schr\"odinger bridge between the two target densities when explicitly optimizing over the interpolant. Finally, algorithmic aspects are discussed and the approach is illustrated on numerical examples.
    New Insights into Graph Convolutional Networks using Neural Tangent Kernels. (arXiv:2110.04060v2 [cs.LG] UPDATED)
    Graph Convolutional Networks (GCNs) have emerged as powerful tools for learning on network structured data. Although empirically successful, GCNs exhibit certain behaviour that has no rigorous explanation -- for instance, the performance of GCNs significantly degrades with increasing network depth, whereas it improves marginally with depth using skip connections. This paper focuses on semi-supervised learning on graphs, and explains the above observations through the lens of Neural Tangent Kernels (NTKs). We derive NTKs corresponding to infinitely wide GCNs (with and without skip connections). Subsequently, we use the derived NTKs to identify that, with suitable normalisation, network depth does not always drastically reduce the performance of GCNs -- a fact that we also validate through extensive simulation. Furthermore, we propose NTK as an efficient `surrogate model' for GCNs that does not suffer from performance fluctuations due to hyper-parameter tuning since it is a hyper-parameter free deterministic kernel. The efficacy of this idea is demonstrated through a comparison of different skip connections for GCNs using the surrogate NTKs.
    The Exact Sample Complexity Gain from Invariances for Kernel Regression. (arXiv:2303.14269v2 [cs.LG] UPDATED)
    In practice, encoding invariances into models improves sample complexity. In this work, we study this phenomenon from a theoretical perspective. In particular, we provide minimax optimal rates for kernel ridge regression on compact manifolds, with a target function that is invariant to a group action on the manifold. Our results hold for any smooth compact Lie group action, even groups of positive dimension. For a finite group, the gain effectively multiplies the number of samples by the group size. For groups of positive dimension, the gain is observed by a reduction in the manifold's dimension, in addition to a factor proportional to the volume of the quotient space. Our proof takes the viewpoint of differential geometry, in contrast to the more common strategy of using invariant polynomials. This new geometric viewpoint on learning with invariances may be of independent interest.
    Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation. (arXiv:2310.18919v2 [cs.LG] UPDATED)
    Recent studies in reinforcement learning (RL) have made significant progress by leveraging function approximation to alleviate the sample complexity hurdle for better performance. Despite the success, existing provably efficient algorithms typically rely on the accessibility of immediate feedback upon taking actions. The failure to account for the impact of delay in observations can significantly degrade the performance of real-world systems due to the regret blow-up. In this work, we tackle the challenge of delayed feedback in RL with linear function approximation by employing posterior sampling, which has been shown to empirically outperform the popular UCB algorithms in a wide range of regimes. We first introduce Delayed-PSVI, an optimistic value-based algorithm that effectively explores the value function space via noise perturbation with posterior sampling. We provide the first analysis for posterior sampling algorithms with delayed feedback in RL and show our algorithm achieves $\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 E[\tau])$ worst-case regret in the presence of unknown stochastic delays. Here $E[\tau]$ is the expected delay. To further improve its computational efficiency and to expand its applicability in high-dimensional RL problems, we incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI, which maintains the same order-optimal regret guarantee with $\widetilde{O}(dHK)$ computational cost. Empirical evaluations are performed to demonstrate the statistical and computational efficacy of our algorithms.
    Using DUCK-Net for Polyp Image Segmentation. (arXiv:2311.02239v1 [cs.CV])
    This paper presents a novel supervised convolutional neural network architecture, "DUCK-Net", capable of effectively learning and generalizing from small amounts of medical images to perform accurate segmentation tasks. Our model utilizes an encoder-decoder structure with a residual downsampling mechanism and a custom convolutional block to capture and process image information at multiple resolutions in the encoder segment. We employ data augmentation techniques to enrich the training set, thus increasing our model's performance. While our architecture is versatile and applicable to various segmentation tasks, in this study, we demonstrate its capabilities specifically for polyp segmentation in colonoscopy images. We evaluate the performance of our method on several popular benchmark datasets for polyp segmentation, Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, and ETIS-LARIBPOLYPDB showing that it achieves state-of-the-art results in terms of mean Dice coefficient, Jaccard index, Precision, Recall, and Accuracy. Our approach demonstrates strong generalization capabilities, achieving excellent performance even with limited training data. The code is publicly available on GitHub: https://github.com/RazvanDu/DUCK-Net
    Understanding Curriculum Learning in Policy Optimization for Online Combinatorial Optimization. (arXiv:2202.05423v3 [cs.LG] UPDATED)
    Over the recent years, reinforcement learning (RL) starts to show promising results in tackling combinatorial optimization (CO) problems, in particular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for online CO problems. We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical online CO problem, the Best Choice Problem (BCP), we formally prove that distribution shift is reduced exponentially with curriculum learning even if the curriculum is a randomly generated BCP on a smaller scale. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on the Best Choice Problem, Online Knapsack, and AdWords to verify our findings.
    Robust Fine-Tuning of Vision-Language Models for Domain Generalization. (arXiv:2311.02236v1 [cs.CV])
    Transfer learning enables the sharing of common knowledge among models for a variety of downstream tasks, but traditional methods suffer in limited training data settings and produce narrow models incapable of effectively generalizing under distribution shifts. Foundation models have recently demonstrated impressive zero-shot inference capabilities and robustness under distribution shifts. However, zero-shot evaluation for these models has been predominantly confined to benchmarks with simple distribution shifts, limiting our understanding of their effectiveness under the more realistic shifts found in practice. Moreover, common fine-tuning methods for these models have yet to be evaluated against vision models in few-shot scenarios where training data is limited. To address these gaps, we present a new recipe for few-shot fine-tuning of the popular vision-language foundation model CLIP and evaluate its performance on challenging benchmark datasets with realistic distribution shifts from the WILDS collection. Our experimentation demonstrates that, while zero-shot CLIP fails to match performance of trained vision models on more complex benchmarks, few-shot CLIP fine-tuning outperforms its vision-only counterparts in terms of in-distribution and out-of-distribution accuracy at all levels of training data availability. This provides a strong incentive for adoption of foundation models within few-shot learning applications operating with real-world data. Code is available at https://github.com/mit-ll/robust-vision-language-finetuning
    Predicting Ground Reaction Force from Inertial Sensors. (arXiv:2311.02287v1 [cs.LG])
    The study of ground reaction forces (GRF) is used to characterize the mechanical loading experienced by individuals in movements such as running, which is clinically applicable to identify athletes at risk for stress-related injuries. Our aim in this paper is to determine if data collected with inertial measurement units (IMUs), that can be worn by athletes during outdoor runs, can be used to predict GRF with sufficient accuracy to allow the analysis of its derived biomechanical variables (e.g., contact time and loading rate). In this paper, we consider lightweight approaches in contrast to state-of-the-art prediction using LSTM neural networks. Specifically, we compare use of LSTMs to k-Nearest Neighbors (KNN) regression as well as propose a novel solution, SVD Embedding Regression (SER), using linear regression between singular value decomposition embeddings of IMUs data (input) and GRF data (output). We evaluate the accuracy of these techniques when using training data collected from different athletes, from the same athlete, or both, and we explore the use of acceleration and angular velocity data from sensors at different locations (sacrum and shanks). Our results illustrate that simple machine learning methods such as SER and KNN can be similarly accurate or more accurate than LSTM neural networks, with much faster training times and hyperparameter optimization; in particular, SER and KNN are more accurate when personal training data are available, and KNN comes with benefit of providing provenance of prediction. Notably, the use of personal data reduces prediction errors of all methods for most biomechanical variables.
    Rethinking Symmetric Matrix Factorization: A More General and Better Clustering Perspective. (arXiv:2209.02528v3 [cs.LG] UPDATED)
    Nonnegative matrix factorization (NMF) is widely used for clustering with strong interpretability. Among general NMF problems, symmetric NMF is a special one that plays an important role in graph clustering where each element measures the similarity between data points. Most existing symmetric NMF algorithms require factor matrices to be nonnegative, and only focus on minimizing the gap between similarity matrix and its approximation for clustering, without giving a consideration to other potential regularization terms which can yield better clustering. In this paper, we explore factorizing a symmetric matrix that does not have to be nonnegative, presenting an efficient factorization algorithm with a regularization term to boost the clustering performance. Moreover, a more general framework is proposed to solve symmetric matrix factorization problems with different constraints on the factor matrices.
    State-wise Safe Reinforcement Learning With Pixel Observations. (arXiv:2311.02227v1 [cs.LG])
    Reinforcement Learning(RL) in the context of safe exploration has long grappled with the challenges of the delicate balance between maximizing rewards and minimizing safety violations, the complexities arising from contact-rich or non-smooth environments, and high-dimensional pixel observations. Furthermore, incorporating state-wise safety constraints in the exploration and learning process, where the agent is prohibited from accessing unsafe regions without prior knowledge, adds an additional layer of complexity. In this paper, we propose a novel pixel-observation safe RL algorithm that efficiently encodes state-wise safety constraints with unknown hazard regions through the introduction of a latent barrier function learning mechanism. As a joint learning framework, our approach first involves constructing a latent dynamics model with low-dimensional latent spaces derived from pixel observations. Subsequently, we build and learn a latent barrier function on top of the latent dynamics and conduct policy optimization simultaneously, thereby improving both safety and the total expected return. Experimental evaluations on the safety-gym benchmark suite demonstrate that our proposed method significantly reduces safety violations throughout the training process and demonstrates faster safety convergence compared to existing methods while achieving competitive results in reward return.
    Gray Learning from Non-IID Data with Out-of-distribution Samples. (arXiv:2206.09375v2 [cs.LG] UPDATED)
    The integrity of training data, even when annotated by experts, is far from guaranteed, especially for non-IID datasets comprising both in- and out-of-distribution samples. In an ideal scenario, the majority of samples would be in-distribution, while samples that deviate semantically would be identified as out-of-distribution and excluded during the annotation process. However, experts may erroneously classify these out-of-distribution samples as in-distribution, assigning them labels that are inherently unreliable. This mixture of unreliable labels and varied data types makes the task of learning robust neural networks notably challenging. We observe that both in- and out-of-distribution samples can almost invariably be ruled out from belonging to certain classes, aside from those corresponding to unreliable ground-truth labels. This opens the possibility of utilizing reliable complementary labels that indicate the classes to which a sample does not belong. Guided by this insight, we introduce a novel approach, termed \textit{Gray Learning} (GL), which leverages both ground-truth and complementary labels. Crucially, GL adaptively adjusts the loss weights for these two label types based on prediction confidence levels. By grounding our approach in statistical learning theory, we derive bounds for the generalization error, demonstrating that GL achieves tight constraints even in non-IID settings. Extensive experimental evaluations reveal that our method significantly outperforms alternative approaches grounded in robust statistics.
    Benign Overfitting for Two-layer ReLU Convolutional Neural Networks. (arXiv:2303.04145v2 [cs.LG] UPDATED)
    Modern deep learning models with great expressive power can be trained to overfit the training data but still generalize well. This phenomenon is referred to as \textit{benign overfitting}. Recently, a few studies have attempted to theoretically understand benign overfitting in neural networks. However, these works are either limited to neural networks with smooth activation functions or to the neural tangent kernel regime. How and when benign overfitting can occur in ReLU neural networks remains an open problem. In this work, we seek to answer this question by establishing algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise. We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk. Our result also reveals a sharp transition between benign and harmful overfitting under different conditions on data distribution in terms of test risk. Experiments on synthetic data back up our theory.
    Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs. (arXiv:2311.02262v1 [cs.CL])
    In human-written articles, we often leverage the subtleties of text style, such as bold and italics, to guide the attention of readers. These textual emphases are vital for the readers to grasp the conveyed information. When interacting with large language models (LLMs), we have a similar need - steering the model to pay closer attention to user-specified information, e.g., an instruction. Existing methods, however, are constrained to process plain text and do not support such a mechanism. This motivates us to introduce PASTA - Post-hoc Attention STeering Approach, a method that allows LLMs to read text with user-specified emphasis marks. To this end, PASTA identifies a small subset of attention heads and applies precise attention reweighting on them, directing the model attention to user-specified parts. Like prompting, PASTA is applied at inference time and does not require changing any model parameters. Experiments demonstrate that PASTA can substantially enhance an LLM's ability to follow user instructions or integrate new knowledge from user inputs, leading to a significant performance improvement on a variety of tasks, e.g., an average accuracy improvement of 22% for LLAMA-7B. Our code is publicly available at https://github.com/QingruZhang/PASTA .
    Distraction is All You Need for Fairness. (arXiv:2203.07593v3 [cs.LG] UPDATED)
    Bias in training datasets must be managed for various groups in classification tasks to ensure parity or equal treatment. With the recent growth in artificial intelligence models and their expanding role in automated decision-making, ensuring that these models are not biased is vital. There is an abundance of evidence suggesting that these models could contain or even amplify the bias present in the data on which they are trained, inherent to their objective function and learning algorithms; Many researchers direct their attention to this issue in different directions, namely, changing data to be statistically independent, adversarial training for restricting the capabilities of a particular competitor who aims to maximize parity, etc. These methods result in information loss and do not provide a suitable balance between accuracy and fairness or do not ensure limiting the biases in training. To this end, we propose a powerful strategy for training deep learning models called the Distraction module, which can be theoretically proven effective in controlling bias from affecting the classification results. This method can be utilized with different data types (e.g., Tabular, images, graphs, etc.). We demonstrate the potency of the proposed method by testing it on UCI Adult and Heritage Health datasets (tabular), POKEC-Z, POKEC-N and NBA datasets (graph), and CelebA dataset (vision). Using state-of-the-art methods proposed in the fairness literature for each dataset, we exhibit our model is superior to these proposed methods in minimizing bias and maintaining accuracy.
    The Potential of Wearable Sensors for Assessing Patient Acuity in Intensive Care Unit (ICU). (arXiv:2311.02251v1 [cs.LG])
    Acuity assessments are vital in critical care settings to provide timely interventions and fair resource allocation. Traditional acuity scores rely on manual assessments and documentation of physiological states, which can be time-consuming, intermittent, and difficult to use for healthcare providers. Furthermore, such scores do not incorporate granular information such as patients' mobility level, which can indicate recovery or deterioration in the ICU. We hypothesized that existing acuity scores could be potentially improved by employing Artificial Intelligence (AI) techniques in conjunction with Electronic Health Records (EHR) and wearable sensor data. In this study, we evaluated the impact of integrating mobility data collected from wrist-worn accelerometers with clinical data obtained from EHR for developing an AI-driven acuity assessment score. Accelerometry data were collected from 86 patients wearing accelerometers on their wrists in an academic hospital setting. The data was analyzed using five deep neural network models: VGG, ResNet, MobileNet, SqueezeNet, and a custom Transformer network. These models outperformed a rule-based clinical score (SOFA= Sequential Organ Failure Assessment) used as a baseline, particularly regarding the precision, sensitivity, and F1 score. The results showed that while a model relying solely on accelerometer data achieved limited performance (AUC 0.50, Precision 0.61, and F1-score 0.68), including demographic information with the accelerometer data led to a notable enhancement in performance (AUC 0.69, Precision 0.75, and F1-score 0.67). This work shows that the combination of mobility and patient information can successfully differentiate between stable and unstable states in critically ill patients.
    A New Bandit Setting Balancing Information from State Evolution and Corrupted Context. (arXiv:2011.07989v4 [cs.LG] UPDATED)
    We propose a new sequential decision-making setting, combining key aspects of two established online learning problems with bandit feedback. The optimal action to play at any given moment is contingent on an underlying changing state which is not directly observable by the agent. Each state is associated with a context distribution, possibly corrupted, allowing the agent to identify the state. Furthermore, states evolve in a Markovian fashion, providing useful information to estimate the current state via state history. In the proposed problem setting, we tackle the challenge of deciding on which of the two sources of information the agent should base its arm selection. We present an algorithm that uses a referee to dynamically combine the policies of a contextual bandit and a multi-armed bandit. We capture the time-correlation of states through iteratively learning the action-reward transition model, allowing for efficient exploration of actions. Our setting is motivated by adaptive mobile health (mHealth) interventions. Users transition through different, time-correlated, but only partially observable internal states, determining their current needs. The side information associated with each internal state might not always be reliable, and standard approaches solely rely on the context risk of incurring high regret. Similarly, some users might exhibit weaker correlations between subsequent states, leading to approaches that solely rely on state transitions risking the same. We analyze our setting and algorithm in terms of regret lower bound and upper bounds and evaluate our method on simulated medication adherence intervention data and several real-world data sets, showing improved empirical performance compared to several popular algorithms.
    Contrastive Multi-Modal Representation Learning for Spark Plug Fault Diagnosis. (arXiv:2311.02282v1 [cs.LG])
    Due to the incapability of one sensory measurement to provide enough information for condition monitoring of some complex engineered industrial mechanisms and also for overcoming the misleading noise of a single sensor, multiple sensors are installed to improve the condition monitoring of some industrial equipment. Therefore, an efficient data fusion strategy is demanded. In this research, we presented a Denoising Multi-Modal Autoencoder with a unique training strategy based on contrastive learning paradigm, both being utilized for the first time in the machine health monitoring realm. The presented approach, which leverages the merits of both supervised and unsupervised learning, not only achieves excellent performance in fusing multiple modalities (or views) of data into an enriched common representation but also takes data fusion to the next level wherein one of the views can be omitted during inference time with very slight performance reduction, or even without any reduction at all. The presented methodology enables multi-modal fault diagnosis systems to perform more robustly in case of sensor failure occurrence, and one can also intentionally omit one of the sensors (the more expensive one) in order to build a more cost-effective condition monitoring system without sacrificing performance for practical purposes. The effectiveness of the presented methodology is examined on a real-world private multi-modal dataset gathered under non-laboratory conditions from a complex engineered mechanism, an inline four-stroke spark-ignition engine, aiming for spark plug fault diagnosis. This dataset, which contains the accelerometer and acoustic signals as two modalities, has a very slight amount of fault, and achieving good performance on such a dataset promises that the presented method can perform well on other equipment as well.
    One-shot Imitation Learning via Interaction Warping. (arXiv:2306.12392v2 [cs.RO] UPDATED)
    Imitation learning of robot policies from few demonstrations is crucial in open-ended applications. We propose a new method, Interaction Warping, for learning SE(3) robotic manipulation policies from a single demonstration. We infer the 3D mesh of each object in the environment using shape warping, a technique for aligning point clouds across object instances. Then, we represent manipulation actions as keypoints on objects, which can be warped with the shape of the object. We show successful one-shot imitation learning on three simulated and real-world object re-arrangement tasks. We also demonstrate the ability of our method to predict object meshes and robot grasps in the wild.  ( 2 min )
    Contrast Everything: A Hierarchical Contrastive Framework for Medical Time-Series. (arXiv:2310.14017v4 [cs.LG] UPDATED)
    Contrastive representation learning is crucial in medical time series analysis as it alleviates dependency on labor-intensive, domain-specific, and scarce expert annotations. However, existing contrastive learning methods primarily focus on one single data level, which fails to fully exploit the intricate nature of medical time series. To address this issue, we present COMET, an innovative hierarchical framework that leverages data consistencies at all inherent levels in medical time series. Our meticulously designed model systematically captures data consistency from four potential levels: observation, sample, trial, and patient levels. By developing contrastive loss at multiple levels, we can learn effective representations that preserve comprehensive data consistency, maximizing information utilization in a self-supervised manner. We conduct experiments in the challenging patient-independent setting. We compare COMET against six baselines using three diverse datasets, which include ECG signals for myocardial infarction and EEG signals for Alzheimer's and Parkinson's diseases. The results demonstrate that COMET consistently outperforms all baselines, particularly in setup with 10% and 1% labeled data fractions across all datasets. These results underscore the significant impact of our framework in advancing contrastive representation learning techniques for medical time series. The source code is available at https://github.com/DL4mHealth/COMET.
    Robust representations of oil wells' intervals via sparse attention mechanism. (arXiv:2212.14246v3 [cs.LG] UPDATED)
    Transformer-based neural network architectures achieve state-of-the-art results in different domains, from natural language processing (NLP) to computer vision (CV). The key idea of Transformers, the attention mechanism, has already led to significant breakthroughs in many areas. The attention has found their implementation for time series data as well. However, due to the quadratic complexity of the attention calculation regarding input sequence length, the application of Transformers is limited by high resource demands. Moreover, their modifications for industrial time series need to be robust to missing or noised values, which complicates the expansion of the horizon of their application. To cope with these issues, we introduce the class of efficient Transformers named Regularized Transformers (Reguformers). We implement the regularization technique inspired by the dropout ideas to improve robustness and reduce computational expenses. The focus in our experiments is on oil&gas data, namely, well logs, a prominent example of multivariate time series. The goal is to solve the problems of similarity and representation learning for them. To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells. The experiments show that all variations of Reguformers outperform the previously developed RNNs, classical Transformer model, and robust modifications of it like Informer and Performer in terms of well-intervals' classification and the quality of the obtained well-intervals' representations. Moreover, the sustainability to missing and incorrect data in our models exceeds that of others by a significant margin. The best result that the Reguformer achieves on well-interval similarity task is the mean PR~AUC score equal to 0.983, which is comparable to the classical Transformer and outperforms the previous models.
    ZRG: A Dataset for Multimodal 3D Residential Rooftop Understanding. (arXiv:2304.13219v2 [cs.CV] UPDATED)
    A crucial part of any home is the roof over our heads to protect us from the elements. In this paper we present the Zeitview Rooftop Geometry (ZRG) dataset for residential rooftop understanding. ZRG is a large-scale residential rooftop dataset of over 20k properties collected through roof inspections from across the U.S. and contains multiple modalities including high resolution aerial orthomosaics, digital surface models (DSM), colored point clouds, and 3D roof wireframe annotations. We provide an in-depth analysis and perform several experimental baselines including roof outline extraction, monocular height estimation, and planar roof structure extraction, to illustrate a few of the numerous potential applications unlocked by this dataset.
    Imitation Bootstrapped Reinforcement Learning. (arXiv:2311.02198v1 [cs.LG])
    Despite the considerable potential of reinforcement learning (RL), robotics control tasks predominantly rely on imitation learning (IL) owing to its better sample efficiency. However, given the high cost of collecting extensive demonstrations, RL is still appealing if it can utilize limited imitation data for efficient autonomous self-improvement. Existing RL methods that utilize demonstrations either initialize the replay buffer with demonstrations and oversample them during RL training, which does not benefit from the generalization potential of modern IL methods, or pretrain the RL policy with IL on the demonstrations, which requires additional mechanisms to prevent catastrophic forgetting during RL fine-tuning. We propose imitation bootstrapped reinforcement learning (IBRL), a novel framework that first trains an IL policy on a limited number of demonstrations and then uses it to propose alternative actions for both online exploration and target value bootstrapping. IBRL achieves SoTA performance and sample efficiency on 7 challenging sparse reward continuous control tasks in simulation while learning directly from pixels. As a highlight of our method, IBRL achieves $6.4\times$ higher success rate than RLPD, a strong method that combines the idea of oversampling demonstrations with modern RL improvements, under the budget of 10 demos and 100K interactions in the challenging PickPlaceCan task in the Robomimic benchmark.
    Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. (arXiv:2210.07893v3 [stat.ML] UPDATED)
    Gaussian processes are frequently deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts of the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. To do so, we first review numerical stability, and illustrate typical situations in which Gaussian process models can be unstable. Building on stability theory originally developed in the interpolation literature, we derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We provide illustrative examples showing the relationship between stability of calculations and predictive performance of inducing point methods on spatial tasks.
    A Closer Look at Reward Decomposition for High-Level Robotic Explanations. (arXiv:2304.12958v2 [cs.LG] UPDATED)
    Explaining the behaviour of intelligent agents learned by reinforcement learning (RL) to humans is challenging yet crucial due to their incomprehensible proprioceptive states, variational intermediate goals, and resultant unpredictability. Moreover, one-step explanations for RL agents can be ambiguous as they fail to account for the agent's future behaviour at each transition, adding to the complexity of explaining robot actions. By leveraging abstracted actions that map to task-specific primitives, we avoid explanations on the movement level. To further improve the transparency and explainability of robotic systems, we propose an explainable Q-Map learning framework that combines reward decomposition (RD) with abstracted action spaces, allowing for non-ambiguous and high-level explanations based on object properties in the task. We demonstrate the effectiveness of our framework through quantitative and qualitative analysis of two robotic scenarios, showcasing visual and textual explanations, from output artefacts of RD explanations, that are easy for humans to comprehend. Additionally, we demonstrate the versatility of integrating these artefacts with large language models (LLMs) for reasoning and interactive querying.
    A unified recipe for deriving (time-uniform) PAC-Bayes bounds. (arXiv:2302.03421v4 [stat.ML] UPDATED)
    We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville's inequality. Our main result is a PAC-Bayes theorem which holds for a wide class of discrete stochastic processes. We show how this result implies time-uniform versions of well-known classical PAC-Bayes bounds, such as those of Seeger, McAllester, Maurer, and Catoni, in addition to many recent bounds. We also present several novel bounds. Our framework also enables us to relax traditional assumptions; in particular, we consider nonstationary loss functions and non-i.i.d. data. In sum, we unify the derivation of past bounds and ease the search for future bounds: one may simply check if our supermartingale or submartingale conditions are met and, if so, be guaranteed a (time-uniform) PAC-Bayes bound.
    Improving Code Example Recommendations on Informal Documentation Using BERT and Query-Aware LSH: A Comparative Study. (arXiv:2305.03017v4 [cs.SE] UPDATED)
    Our research investigates the recommendation of code examples to aid software developers, a practice that saves developers significant time by providing ready-to-use code snippets. The focus of our study is Stack Overflow, a commonly used resource for coding discussions and solutions, particularly in the context of the Java programming language. We applied BERT, a powerful Large Language Model (LLM) that enables us to transform code examples into numerical vectors by extracting their semantic information. Once these numerical representations are prepared, we identify Approximate Nearest Neighbors (ANN) using Locality-Sensitive Hashing (LSH). Our research employed two variants of LSH: Random Hyperplane-based LSH and Query-Aware LSH. We rigorously compared these two approaches across four parameters: HitRate, Mean Reciprocal Rank (MRR), Average Execution Time, and Relevance. Our study revealed that the Query-Aware (QA) approach showed superior performance over the Random Hyperplane-based (RH) method. Specifically, it exhibited a notable improvement of 20\% to 35\% in HitRate for query pairs compared to the RH approach. Furthermore, the QA approach proved significantly more time-efficient, with its speed in creating hashing tables and assigning data samples to buckets being at least four times faster. It can return code examples within milliseconds, whereas the RH approach typically requires several seconds to recommend code examples. Due to the superior performance of the QA approach, we tested it against PostFinder and FaCoY, the state-of-the-art baselines. Our QA method showed comparable efficiency proving its potential for effective code recommendation.
    AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation. (arXiv:2301.08110v5 [cs.LG] UPDATED)
    Generative transformer models have become increasingly complex, with large numbers of parameters and the ability to process multiple input modalities. Current methods for explaining their predictions are resource-intensive. Most crucially, they require prohibitively large amounts of extra memory, since they rely on backpropagation which allocates almost twice as much GPU memory as the forward pass. This makes it difficult, if not impossible, to use them in production. We present AtMan that provides explanations of generative transformer models at almost no extra cost. Specifically, AtMan is a modality-agnostic perturbation method that manipulates the attention mechanisms of transformers to produce relevance maps for the input with respect to the output prediction. Instead of using backpropagation, AtMan applies a parallelizable token-based search method based on cosine similarity neighborhood in the embedding space. Our exhaustive experiments on text and image-text benchmarks demonstrate that AtMan outperforms current state-of-the-art gradient-based methods on several metrics while being computationally efficient. As such, AtMan is suitable for use in large model inference deployments.
    An adaptive safety layer with hard constraints for safe reinforcement learning in multi-energy management systems. (arXiv:2304.08897v3 [eess.SY] UPDATED)
    Safe reinforcement learning (RL) with hard constraint guarantees is a promising optimal control direction for multi-energy management systems. It only requires the environment-specific constraint functions itself a priori and not a complete model. The project-specific upfront and ongoing engineering efforts are therefore still reduced, better representations of the underlying system dynamics can still be learnt, and modelling bias is kept to a minimum. However, even the constraint functions alone are not always trivial to accurately provide in advance, leading to potentially unsafe behaviour. In this paper, we present two novel advancements: (I) combining the OptLayer and SafeFallback method, named OptLayerPolicy, to increase the initial utility while keeping a high sample efficiency and the possibility to formulate equality constraints. (II) introducing self-improving hard constraints, to increase the accuracy of the constraint functions as more and new data becomes available so that better policies can be learnt. Both advancements keep the constraint formulation decoupled from the RL formulation, so new (presumably better) RL algorithms can act as drop-in replacements. We have shown that, in a simulated multi-energy system case study, the initial utility is increased to 92.4% (OptLayerPolicy) compared to 86.1% (OptLayer) and that the policy after training is increased to 104.9% (GreyOptLayerPolicy) compared to 103.4% (OptLayer) - all relative to a vanilla RL benchmark. Although introducing surrogate functions into the optimisation problem requires special attention, we conclude that the newly presented GreyOptLayerPolicy method is the most advantageous.
    Gaussian Process Probes (GPP) for Uncertainty-Aware Probing. (arXiv:2305.18213v2 [cs.LG] UPDATED)
    Understanding which concepts models can and cannot represent has been fundamental to many tasks: from effective and responsible use of models to detecting out of distribution data. We introduce Gaussian process probes (GPP), a unified and simple framework for probing and measuring uncertainty about concepts represented by models. As a Bayesian extension of linear probing methods, GPP asks what kind of distribution over classifiers (of concepts) is induced by the model. This distribution can be used to measure both what the model represents and how confident the probe is about what the model represents. GPP can be applied to any pre-trained model with vector representations of inputs (e.g., activations). It does not require access to training data, gradients, or the architecture. We validate GPP on datasets containing both synthetic and real images. Our experiments show it can (1) probe a model's representations of concepts even with a very small number of examples, (2) accurately measure both epistemic uncertainty (how confident the probe is) and aleatory uncertainty (how fuzzy the concepts are to the model), and (3) detect out of distribution data using those uncertainty measures as well as classic methods do. By using Gaussian processes to expand what probing can offer, GPP provides a data-efficient, versatile and uncertainty-aware tool for understanding and evaluating the capabilities of machine learning models.  ( 3 min )
    Hierarchical Reinforcement Learning for Power Network Topology Control. (arXiv:2311.02129v1 [cs.LG])
    Learning in high-dimensional action spaces is a key challenge in applying reinforcement learning (RL) to real-world systems. In this paper, we study the possibility of controlling power networks using RL methods. Power networks are critical infrastructures that are complex to control. In particular, the combinatorial nature of the action space poses a challenge to both conventional optimizers and learned controllers. Hierarchical reinforcement learning (HRL) represents one approach to address this challenge. More precisely, a HRL framework for power network topology control is proposed. The HRL framework consists of three levels of action abstraction. At the highest level, there is the overall long-term task of power network operation, namely, keeping the power grid state within security constraints at all times, which is decomposed into two temporally extended actions: 'do nothing' versus 'propose a topology change'. At the intermediate level, the action space consists of all controllable substations. Finally, at the lowest level, the action space consists of all configurations of the chosen substation. By employing this HRL framework, several hierarchical power network agents are trained for the IEEE 14-bus network. Whereas at the highest level a purely rule-based policy is still chosen for all agents in this study, at the intermediate level the policy is trained using different state-of-the-art RL algorithms. At the lowest level, either an RL algorithm or a greedy algorithm is used. The performance of the different 3-level agents is compared with standard baseline (RL or greedy) approaches. A key finding is that the 3-level agent that employs RL both at the intermediate and the lowest level outperforms all other agents on the most difficult task. Our code is publicly available.
    Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data. (arXiv:2311.02216v1 [cs.CL])
    Numbers are crucial for various real-world domains such as finance, economics, and science. Thus, understanding and reasoning with numbers are essential skills for language models to solve different tasks. While different numerical benchmarks have been introduced in recent years, they are limited to specific numerical aspects mostly. In this paper, we propose a hierarchical taxonomy for numerical reasoning skills with more than ten reasoning types across four levels: representation, number sense, manipulation, and complex reasoning. We conduct a comprehensive evaluation of state-of-the-art models to identify reasoning challenges specific to them. Henceforth, we develop a diverse set of numerical probes employing a semi-automated approach. We focus on the tabular Natural Language Inference (TNLI) task as a case study and measure models' performance shifts. Our results show that no model consistently excels across all numerical reasoning types. Among the probed models, FlanT5 (few-/zero-shot) and GPT-3.5 (few-shot) demonstrate strong overall numerical reasoning skills compared to other models. Label-flipping probes indicate that models often exploit dataset artifacts to predict the correct labels.
    Automating Governing Knowledge Commons and Contextual Integrity (GKC-CI) Privacy Policy Annotations with Large Language Models. (arXiv:2311.02192v1 [cs.CY])
    Identifying contextual integrity (CI) and governing knowledge commons (GKC) parameters in privacy policy texts can facilitate normative privacy analysis. However, GKC-CI annotation has heretofore required manual or crowdsourced effort. This paper demonstrates that high-accuracy GKC-CI parameter annotation of privacy policies can be performed automatically using large language models. We fine-tune 18 open-source and proprietary models on 21,588 GKC-CI annotations from 16 ground truth privacy policies. Our best-performing model (fine-tuned GPT-3.5 Turbo with prompt engineering) has an accuracy of 86%, exceeding the performance of prior crowdsourcing approaches despite the complexity of privacy policy texts and the nuance of the GKC-CI annotation task. We apply our best-performing model to privacy policies from 164 popular online services, demonstrating the effectiveness of scaling GKC-CI annotation for data exploration. We make all annotated policies as well as the training data and scripts needed to fine-tune our best-performing model publicly available for future research.
    Structured Neural Networks for Density Estimation and Causal Inference. (arXiv:2311.02221v1 [cs.LG])
    Injecting structure into neural networks enables learning functions that satisfy invariances with respect to subsets of inputs. For instance, when learning generative models using neural networks, it is advantageous to encode the conditional independence structure of observed variables, often in the form of Bayesian networks. We propose the Structured Neural Network (StrNN), which injects structure through masking pathways in a neural network. The masks are designed via a novel relationship we explore between neural network architectures and binary matrix factorization, to ensure that the desired independencies are respected. We devise and study practical algorithms for this otherwise NP-hard design problem based on novel objectives that control the model architecture. We demonstrate the utility of StrNN in three applications: (1) binary and Gaussian density estimation with StrNN, (2) real-valued density estimation with Structured Autoregressive Flows (StrAFs) and Structured Continuous Normalizing Flows (StrCNF), and (3) interventional and counterfactual analysis with StrAFs for causal inference. Our work opens up new avenues for learning neural networks that enable data-efficient generative modeling and the use of normalizing flows for causal effect estimation.
    Explainable Authorship Identification in Cultural Heritage Applications: Analysis of a New Perspective. (arXiv:2311.02237v1 [cs.LG])
    While a substantial amount of work has recently been devoted to enhance the performance of computational Authorship Identification (AId) systems, little to no attention has been paid to endowing AId systems with the ability to explain the reasons behind their predictions. This lacking substantially hinders the practical employment of AId methodologies, since the predictions returned by such systems are hardly useful unless they are supported with suitable explanations. In this paper, we explore the applicability of existing general-purpose eXplainable Artificial Intelligence (XAI) techniques to AId, with a special focus on explanations addressed to scholars working in cultural heritage. In particular, we assess the relative merits of three different types of XAI techniques (feature ranking, probing, factuals and counterfactual selection) on three different AId tasks (authorship attribution, authorship verification, same-authorship verification) by running experiments on real AId data. Our analysis shows that, while these techniques make important first steps towards explainable Authorship Identification, more work remains to be done in order to provide tools that can be profitably integrated in the workflows of scholars.
    Joint Composite Latent Space Bayesian Optimization. (arXiv:2311.02213v1 [cs.LG])
    Bayesian Optimization (BO) is a technique for sample-efficient black-box optimization that employs probabilistic models to identify promising input locations for evaluation. When dealing with composite-structured functions, such as f=g o h, evaluating a specific location x yields observations of both the final outcome f(x) = g(h(x)) as well as the intermediate output(s) h(x). Previous research has shown that integrating information from these intermediate outputs can enhance BO performance substantially. However, existing methods struggle if the outputs h(x) are high-dimensional. Many relevant problems fall into this setting, including in the context of generative AI, molecular design, or robotics. To effectively tackle these challenges, we introduce Joint Composite Latent Space Bayesian Optimization (JoCo), a novel framework that jointly trains neural network encoders and probabilistic models to adaptively compress high-dimensional input and output spaces into manageable latent representations. This enables viable BO on these compressed representations, allowing JoCo to outperform other state-of-the-art methods in high-dimensional BO on a wide variety of simulated and real-world problems.
    Emergence of Abstract State Representations in Embodied Sequence Modeling. (arXiv:2311.02171v1 [cs.LG])
    Decision making via sequence modeling aims to mimic the success of language models, where actions taken by an embodied agent are modeled as tokens to predict. Despite their promising performance, it remains unclear if embodied sequence modeling leads to the emergence of internal representations that represent the environmental state information. A model that lacks abstract state representations would be liable to make decisions based on surface statistics which fail to generalize. We take the BabyAI environment, a grid world in which language-conditioned navigation tasks are performed, and build a sequence modeling Transformer, which takes a language instruction, a sequence of actions, and environmental observations as its inputs. In order to investigate the emergence of abstract state representations, we design a "blindfolded" navigation task, where only the initial environmental layout, the language instruction, and the action sequence to complete the task are available for training. Our probing results show that intermediate environmental layouts can be reasonably reconstructed from the internal activations of a trained model, and that language instructions play a role in the reconstruction accuracy. Our results suggest that many key features of state representations can emerge via embodied sequence modeling, supporting an optimistic outlook for applications of sequence modeling objectives to more complex embodied decision-making domains.
    AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation. (arXiv:2311.02194v1 [cs.LG])
    One of the main challenges in offline Reinforcement Learning (RL) is the distribution shift that arises from the learned policy deviating from the data collection policy. This is often addressed by avoiding out-of-distribution (OOD) actions during policy improvement as their presence can lead to substantial performance degradation. This challenge is amplified in the offline Multi-Agent RL (MARL) setting since the joint action space grows exponentially with the number of agents. To avoid this curse of dimensionality, existing MARL methods adopt either value decomposition methods or fully decentralized training of individual agents. However, even when combined with standard conservatism principles, these methods can still result in the selection of OOD joint actions in offline MARL. To this end, we introduce AlberDICE, an offline MARL algorithm that alternatively performs centralized training of individual agents based on stationary distribution optimization. AlberDICE circumvents the exponential complexity of MARL by computing the best response of one agent at a time while effectively avoiding OOD joint action selection. Theoretically, we show that the alternating optimization procedure converges to Nash policies. In the experiments, we demonstrate that AlberDICE significantly outperforms baseline algorithms on a standard suite of MARL benchmarks.
    Learning Time-Invariant Representations for Individual Neurons from Population Dynamics. (arXiv:2311.02258v1 [q-bio.NC])
    Neurons can display highly variable dynamics. While such variability presumably supports the wide range of behaviors generated by the organism, their gene expressions are relatively stable in the adult brain. This suggests that neuronal activity is a combination of its time-invariant identity and the inputs the neuron receives from the rest of the circuit. Here, we propose a self-supervised learning based method to assign time-invariant representations to individual neurons based on permutation-, and population size-invariant summary of population recordings. We fit dynamical models to neuronal activity to learn a representation by considering the activity of both the individual and the neighboring population. Our self-supervised approach and use of implicit representations enable robust inference against imperfections such as partial overlap of neurons across sessions, trial-to-trial variability, and limited availability of molecular (transcriptomic) labels for downstream supervised tasks. We demonstrate our method on a public multimodal dataset of mouse cortical neuronal activity and transcriptomic labels. We report > 35% improvement in predicting the transcriptomic subclass identity and > 20% improvement in predicting class identity with respect to the state-of-the-art.
    Heteroskedastic Tensor Clustering. (arXiv:2311.02306v1 [math.ST])
    Tensor clustering, which seeks to extract underlying cluster structures from noisy tensor observations, has gained increasing attention. One extensively studied model for tensor clustering is the tensor block model, which postulates the existence of clustering structures along each mode and has found broad applications in areas like multi-tissue gene expression analysis and multilayer network analysis. However, currently available computationally feasible methods for tensor clustering either are limited to handling i.i.d. sub-Gaussian noise or suffer from suboptimal statistical performance, which restrains their utility in applications that have to deal with heteroskedastic data and/or low signal-to-noise-ratio (SNR). To overcome these challenges, we propose a two-stage method, named $\mathsf{High\text{-}order~HeteroClustering}$ ($\mathsf{HHC}$), which starts by performing tensor subspace estimation via a novel spectral algorithm called $\mathsf{Thresholded~Deflated\text{-}HeteroPCA}$, followed by approximate $k$-means to obtain cluster nodes. Encouragingly, our algorithm provably achieves exact clustering as long as the SNR exceeds the computational limit (ignoring logarithmic factors); here, the SNR refers to the ratio of the pairwise disparity between nodes to the noise level, and the computational limit indicates the lowest SNR that enables exact clustering with polynomial runtime. Comprehensive simulation and real-data experiments suggest that our algorithm outperforms existing algorithms across various settings, delivering more reliable clustering performance.
    Feature Attribution Explanations for Spiking Neural Networks. (arXiv:2311.02110v1 [cs.NE])
    Third-generation artificial neural networks, Spiking Neural Networks (SNNs), can be efficiently implemented on hardware. Their implementation on neuromorphic chips opens a broad range of applications, such as machine learning-based autonomous control and intelligent biomedical devices. In critical applications, however, insight into the reasoning of SNNs is important, thus SNNs need to be equipped with the ability to explain how decisions are reached. We present \textit{Temporal Spike Attribution} (TSA), a local explanation method for SNNs. To compute the explanation, we aggregate all information available in model-internal variables: spike times and model weights. We evaluate TSA on artificial and real-world time series data and measure explanation quality w.r.t. multiple quantitative criteria. We find that TSA correctly identifies a small subset of input features relevant to the decision (i.e., is output-complete and compact) and generates similar explanations for similar inputs (i.e., is continuous). Further, our experiments show that incorporating the notion of \emph{absent} spikes improves explanation quality. Our work can serve as a starting point for explainable SNNs, with future implementations on hardware yielding not only predictions but also explanations in a broad range of application scenarios. Source code is available at https://github.com/ElisaNguyen/tsa-explanations.
    A Systematic Review of Deep Graph Neural Networks: Challenges, Classification, Architectures, Applications & Potential Utility in Bioinformatics. (arXiv:2311.02127v1 [cs.LG])
    In recent years, tasks of machine learning ranging from image processing & audio/video analysis to natural language understanding have been transformed by deep learning. The data content in all these scenarios are expressed via Euclidean space. However, a considerable amount of application data is structured in non-Euclidean space and is expressed as graphs, e.g. dealing with complicated interactions & object interdependencies. Modelling physical systems, learning molecular signatures, identifying protein interactions and predicting diseases involve utilising a model that can adapt from graph data. Graph neural networks (GNNs), specified as artificial-neural models, employ message transmission between graph nodes to represent graph dependencies and are primarily used in the non-Euclidean domain. Variants of GNN like Graph Recurrent Networks (GRN), Graph Auto Encoder (GAE), Graph Convolution Networks (GCN), Graph Adversarial Methods & Graph Reinforcement learning have exhibited breakthrough productivity on a wide range of tasks, especially in the field of bioinformatics, in recent years as a result of the rapid collection of biological network data. Apart from presenting all existing GNN models, mathematical analysis and comparison of the variants of all types of GNN have been highlighted in this survey. Graph neural networks are investigated for their potential real-world applications in various fields, focusing on Bioinformatics. Furthermore, resources for evaluating graph neural network models and accessing open-source code & benchmark data sets are included. Ultimately, we provide some (seven) proposals for future research in this rapidly evolving domain. GNNs have the potential to be an excellent tool for solving a wide range of biological challenges in bioinformatics research, as they are best represented as connected complex graphs.
    A Comprehensive Study on Model Initialization Techniques Ensuring Efficient Federated Learning. (arXiv:2311.02100v1 [cs.LG])
    Advancement in the field of machine learning is unavoidable, but something of major concern is preserving the privacy of the users whose data is being used for training these machine learning algorithms. Federated learning(FL) has emerged as a promising paradigm for training machine learning models in a distributed and privacy-preserving manner which enables one to collaborate and train a global model without sharing local data. But starting this learning process on each device in the right way, called ``model initialization" is critical. The choice of initialization methods used for models plays a crucial role in the performance, convergence speed, communication efficiency, privacy guarantees of federated learning systems, etc. In this survey, we dive deeper into a comprehensive study of various ways of model initialization techniques in FL.Unlike other studies, our research meticulously compares, categorizes, and delineates the merits and demerits of each technique, examining their applicability across diverse FL scenarios. We highlight how factors like client variability, data non-IIDness, model caliber, security considerations, and network restrictions influence FL model outcomes and propose how strategic initialization can address and potentially rectify many such challenges. The motivation behind this survey is to highlight that the right start can help overcome challenges like varying data quality, security issues, and network problems. Our insights provide a foundational base for experts looking to fully utilize FL, also while understanding the complexities of model initialization.
    Efficient Machine Learning Ensemble Methods for Detecting Gravitational Wave Glitches in LIGO Time Series. (arXiv:2311.02106v1 [cs.LG])
    The phenomenon of Gravitational Wave (GW) analysis has grown in popularity as technology has advanced and the process of observing gravitational waves has become more precise. Although the sensitivity and the frequency of observation of GW signals are constantly improving, the possibility of noise in the collected GW data remains. In this paper, we propose two new Machine and Deep learning ensemble approaches (i.e., ShallowWaves and DeepWaves Ensembles) for detecting different types of noise and patterns in datasets from GW observatories. Our research also investigates various Machine and Deep Learning techniques for multi-class classification and provides a comprehensive benchmark, emphasizing the best results in terms of three commonly used performance metrics (i.e., accuracy, precision, and recall). We train and test our models on a dataset consisting of annotated time series from real-world data collected by the Advanced Laser Interferometer GW Observatory (LIGO). We empirically show that the best overall accuracy is obtained by the proposed DeepWaves Ensemble, followed close by the ShallowWaves Ensemble.
    Pairing-based graph neural network for simulating quantum materials. (arXiv:2311.02143v1 [cond-mat.str-el])
    We introduce a pairing-based graph neural network, $\textit{GemiNet}$, for simulating quantum many-body systems. Our architecture augments a BCS mean-field wavefunction with a generalized pair amplitude parameterized by a graph neural network. Variational Monte Carlo with GemiNet simultaneously provides an accurate, flexible, and scalable method for simulating many-electron systems. We apply GemiNet to two-dimensional semiconductor electron-hole bilayers and obtain highly accurate results on a variety of interaction-induced phases, including the exciton Bose-Einstein condensate, electron-hole superconductor, and bilayer Wigner crystal. Our study demonstrates the potential of physically-motivated neural network wavefunctions for quantum materials simulations.
    Combining Deep Learning on Order Books with Reinforcement Learning for Profitable Trading. (arXiv:2311.02088v1 [q-fin.CP])
    High-frequency trading is prevalent, where automated decisions must be made quickly to take advantage of price imbalances and patterns in price action that forecast near-future movements. While many algorithms have been explored and tested, analytical methods fail to harness the whole nature of the market environment by focusing on a limited domain. With the evergrowing machine learning field, many large-scale end-to-end studies on raw data have been successfully employed to increase the domain scope for profitable trading but are very difficult to replicate. Combining deep learning on the order books with reinforcement learning is one way of breaking down large-scale end-to-end learning into more manageable and lightweight components for reproducibility, suitable for retail trading. The following work focuses on forecasting returns across multiple horizons using order flow imbalance and training three temporal-difference learning models for five financial instruments to provide trading signals. The instruments used are two foreign exchange pairs (GBPUSD and EURUSD), two indices (DE40 and FTSE100), and one commodity (XAUUSD). The performances of these 15 agents are evaluated through backtesting simulation, and successful models proceed through to forward testing on a retail trading platform. The results prove potential but require further minimal modifications for consistently profitable trading to fully handle retail trading costs, slippage, and spread fluctuation.
    Client Orchestration and Cost-Efficient Joint Optimization for NOMA-Enabled Hierarchical Federated Learning. (arXiv:2311.02130v1 [cs.LG])
    Hierarchical federated learning (HFL) shows great advantages over conventional two-layer federated learning (FL) in reducing network overhead and interaction latency while still retaining the data privacy of distributed FL clients. However, the communication and energy overhead still pose a bottleneck for HFL performance, especially as the number of clients raises dramatically. To tackle this issue, we propose a non-orthogonal multiple access (NOMA) enabled HFL system under semi-synchronous cloud model aggregation in this paper, aiming to minimize the total cost of time and energy at each HFL global round. Specifically, we first propose a novel fuzzy logic based client orchestration policy considering client heterogenerity in multiple aspects, including channel quality, data quantity and model staleness. Subsequently, given the fuzzy based client-edge association, a joint edge server scheduling and resource allocation problem is formulated. Utilizing problem decomposition, we firstly derive the closed-form solution for the edge server scheduling subproblem via the penalty dual decomposition (PDD) method. Next, a deep deterministic policy gradient (DDPG) based algorithm is proposed to tackle the resource allocation subproblem considering time-varying environments. Finally, extensive simulations demonstrate that the proposed scheme outperforms the considered benchmarks regarding HFL performance improvement and total cost reduction.
    Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective. (arXiv:2305.15408v4 [cs.LG] UPDATED)
    Recent studies have discovered that Chain-of-Thought prompting (CoT) can dramatically improve the performance of Large Language Models (LLMs), particularly when dealing with complex tasks involving mathematics or reasoning. Despite the enormous empirical success, the underlying mechanisms behind CoT and how it unlocks the potential of LLMs remain elusive. In this paper, we take a first step towards theoretically answering these questions. Specifically, we examine the expressivity of LLMs with CoT in solving fundamental mathematical and decision-making problems. By using circuit complexity theory, we first give impossibility results showing that bounded-depth Transformers are unable to directly produce correct answers for basic arithmetic/equation tasks unless the model size grows super-polynomially with respect to the input length. In contrast, we then prove by construction that autoregressive Transformers of constant size suffice to solve both tasks by generating CoT derivations using a commonly used math language format. Moreover, we show LLMs with CoT can handle a general class of decision-making problems known as Dynamic Programming, thus justifying its power in tackling complex real-world tasks. Finally, an extensive set of experiments show that, while Transformers always fail to directly predict the answers, they can consistently learn to generate correct solutions step-by-step given sufficient CoT demonstrations.  ( 3 min )
    Transfer learning for atomistic simulations using GNNs and kernel mean embeddings. (arXiv:2306.01589v4 [cs.LG] UPDATED)
    Interatomic potentials learned using machine learning methods have been successfully applied to atomistic simulations. However, accurate models require large training datasets, while generating reference calculations is computationally demanding. To bypass this difficulty, we propose a transfer learning algorithm that leverages the ability of graph neural networks (GNNs) to represent chemical environments together with kernel mean embeddings. We extract a feature map from GNNs pre-trained on the OC20 dataset and use it to learn the potential energy surface from system-specific datasets of catalytic processes. Our method is further enhanced by incorporating into the kernel the chemical species information, resulting in improved performance and interpretability. We test our approach on a series of realistic datasets of increasing complexity, showing excellent generalization and transferability performance, and improving on methods that rely on GNNs or ridge regression alone, as well as similar fine-tuning approaches.  ( 2 min )
    Linear Oscillation: A Novel Activation Function for Vision Transformer. (arXiv:2308.13670v3 [cs.LG] UPDATED)
    Activation functions are the linchpins of deep learning, profoundly influencing both the representational capacity and training dynamics of neural networks. They shape not only the nature of representations but also optimize convergence rates and enhance generalization potential. Appreciating this critical role, we present the Linear Oscillation (LoC) activation function, defined as $f(x) = x \times \sin(\alpha x + \beta)$. Distinct from conventional activation functions which primarily introduce non-linearity, LoC seamlessly blends linear trajectories with oscillatory deviations. The nomenclature "Linear Oscillation" is a nod to its unique attribute of infusing linear activations with harmonious oscillations, capturing the essence of the "Importance of Confusion". This concept of "controlled confusion" within network activations is posited to foster more robust learning, particularly in contexts that necessitate discerning subtle patterns. Our empirical studies reveal that, when integrated into diverse neural architectures, the LoC activation function consistently outperforms established counterparts like ReLU and Sigmoid. The stellar performance exhibited by the avant-garde Vision Transformer model using LoC further validates its efficacy. This study illuminates the remarkable benefits of the LoC over other prominent activation functions. It champions the notion that intermittently introducing deliberate complexity or "confusion" during training can spur more profound and nuanced learning. This accentuates the pivotal role of judiciously selected activation functions in shaping the future of neural network training.
    WD3: Taming the Estimation Bias in Deep Reinforcement Learning. (arXiv:2006.12622v2 [cs.LG] UPDATED)
    The overestimation phenomenon caused by function approximation is a well-known issue in value-based reinforcement learning algorithms such as deep Q-networks and DDPG, which could lead to suboptimal policies. To address this issue, TD3 takes the minimum value between a pair of critics. In this paper, we show that the TD3 algorithm introduces underestimation bias in mild assumptions. To obtain a more precise estimation for value function, we unify these two opposites and propose a novel algorithm \underline{W}eighted \underline{D}elayed \underline{D}eep \underline{D}eterministic Policy Gradient (WD3), which can eliminate the estimation bias and further improve the performance by weighting a pair of critics. To demonstrate the effectiveness of WD3, we compare the learning process of value function between DDPG, TD3, and WD3. The results verify that our algorithm does eliminate the estimation error of value functions. Furthermore, we evaluate our algorithm on the continuous control tasks. We observe that in each test task, the performance of WD3 consistently outperforms, or at the very least matches, that of the state-of-the-art algorithms\footnote{Our code is available at~\href{https://sites.google.com/view/ictai20-wd3/}{https://sites.google.com/view/ictai20-wd3/}.}.
    Flamingo: Multi-Round Single-Server Secure Aggregation with Applications to Private Federated Learning. (arXiv:2308.09883v2 [cs.CR] UPDATED)
    This paper introduces Flamingo, a system for secure aggregation of data across a large set of clients. In secure aggregation, a server sums up the private inputs of clients and obtains the result without learning anything about the individual inputs beyond what is implied by the final sum. Flamingo focuses on the multi-round setting found in federated learning in which many consecutive summations (averages) of model weights are performed to derive a good model. Previous protocols, such as Bell et al. (CCS '20), have been designed for a single round and are adapted to the federated learning setting by repeating the protocol multiple times. Flamingo eliminates the need for the per-round setup of previous protocols, and has a new lightweight dropout resilience protocol to ensure that if clients leave in the middle of a sum the server can still obtain a meaningful result. Furthermore, Flamingo introduces a new way to locally choose the so-called client neighborhood introduced by Bell et al. These techniques help Flamingo reduce the number of interactions between clients and the server, resulting in a significant reduction in the end-to-end runtime for a full training session over prior work. We implement and evaluate Flamingo and show that it can securely train a neural network on the (Extended) MNIST and CIFAR-100 datasets, and the model converges without a loss in accuracy, compared to a non-private federated learning system.
    Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery. (arXiv:2310.19776v2 [cs.CV] UPDATED)
    In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category? In this paper, we conceptualize a category through the lens of optimization, viewing it as an optimal solution to a well-defined problem. Harnessing this unique conceptualization, we propose a novel, efficient and self-supervised method capable of discovering previously unknown categories at test time. A salient feature of our approach is the assignment of minimum length category codes to individual data instances, which encapsulates the implicit category hierarchy prevalent in real-world datasets. This mechanism affords us enhanced control over category granularity, thereby equipping our model to handle fine-grained categories adeptly. Experimental evaluations, bolstered by state-of-the-art benchmark comparisons, testify to the efficacy of our solution in managing unknown categories at test time. Furthermore, we fortify our proposition with a theoretical foundation, providing proof of its optimality. Our code is available at https://github.com/SarahRastegar/InfoSieve.
    LLMs-augmented Contextual Bandit. (arXiv:2311.02268v1 [cs.LG])
    Contextual bandits have emerged as a cornerstone in reinforcement learning, enabling systems to make decisions with partial feedback. However, as contexts grow in complexity, traditional bandit algorithms can face challenges in adequately capturing and utilizing such contexts. In this paper, we propose a novel integration of large language models (LLMs) with the contextual bandit framework. By leveraging LLMs as an encoder, we enrich the representation of the context, providing the bandit with a denser and more informative view. Preliminary results on synthetic datasets demonstrate the potential of this approach, showing notable improvements in cumulative rewards and reductions in regret compared to traditional bandit algorithms. This integration not only showcases the capabilities of LLMs in reinforcement learning but also opens the door to a new era of contextually-aware decision systems.
    Design Of Rubble Analyzer Probe Using ML For Earthquake. (arXiv:2311.02087v1 [cs.SD])
    The earthquake rubble analyzer uses machine learning to detect human presence via ambient sounds, achieving 97.45% accuracy. It also provides real-time environmental data, aiding in assessing survival prospects for trapped individuals, crucial for post-earthquake rescue efforts
    Resist Label Noise with PGM for Graph Neural Networks. (arXiv:2311.02116v1 [cs.LG])
    While robust graph neural networks (GNNs) have been widely studied for graph perturbation and attack, those for label noise have received significantly less attention. Most existing methods heavily rely on the label smoothness assumption to correct noisy labels, which adversely affects their performance on heterophilous graphs. Further, they generally perform poorly in high noise-rate scenarios. To address these problems, in this paper, we propose a novel probabilistic graphical model (PGM) based framework LNP. Given a noisy label set and a clean label set, our goal is to maximize the likelihood of labels in the clean set. We first present LNP-v1, which generates clean labels based on graphs only in the Bayesian network. To further leverage the information of clean labels in the noisy label set, we put forward LNP-v2, which incorporates the noisy label set into the Bayesian network to generate clean labels. The generative process can then be used to predict labels for unlabeled nodes. We conduct extensive experiments to show the robustness of LNP on varying noise types and rates, and also on graphs with different heterophilies. In particular, we show that LNP can lead to inspiring performance in high noise-rate situations.
    Sliced Denoising: A Physics-Informed Molecular Pre-Training Method. (arXiv:2311.02124v1 [q-bio.BM])
    While molecular pre-training has shown great potential in enhancing drug discovery, the lack of a solid physical interpretation in current methods raises concerns about whether the learned representation truly captures the underlying explanatory factors in observed data, ultimately resulting in limited generalization and robustness. Although denoising methods offer a physical interpretation, their accuracy is often compromised by ad-hoc noise design, leading to inaccurate learned force fields. To address this limitation, this paper proposes a new method for molecular pre-training, called sliced denoising (SliDe), which is based on the classical mechanical intramolecular potential theory. SliDe utilizes a novel noise strategy that perturbs bond lengths, angles, and torsion angles to achieve better sampling over conformations. Additionally, it introduces a random slicing approach that circumvents the computationally expensive calculation of the Jacobian matrix, which is otherwise essential for estimating the force field. By aligning with physical principles, SliDe shows a 42\% improvement in the accuracy of estimated force fields compared to current state-of-the-art denoising methods, and thus outperforms traditional baselines on various molecular property prediction tasks.
    Solving MaxSAT with Matrix Multiplication. (arXiv:2311.02101v1 [cs.AI])
    We propose an incomplete algorithm for Maximum Satisfiability (MaxSAT) specifically designed to run on neural network accelerators such as GPUs and TPUs. Given a MaxSAT problem instance in conjunctive normal form, our procedure constructs a Restricted Boltzmann Machine (RBM) with an equilibrium distribution wherein the probability of a Boolean assignment is exponential in the number of clauses it satisfies. Block Gibbs sampling is used to stochastically search the space of assignments with parallel Markov chains. Since matrix multiplication is the main computational primitive for block Gibbs sampling in an RBM, our approach leads to an elegantly simple algorithm (40 lines of JAX) well-suited for neural network accelerators. Theoretical results about RBMs guarantee that the required number of visible and hidden units of the RBM scale only linearly with the number of variables and constant-sized clauses in the MaxSAT instance, ensuring that the computational cost of a Gibbs step scales reasonably with the instance size. Search throughput can be increased by batching parallel chains within a single accelerator as well as by distributing them across multiple accelerators. As a further enhancement, a heuristic based on unit propagation running on CPU is periodically applied to the sampled assignments. Our approach, which we term RbmSAT, is a new design point in the algorithm-hardware co-design space for MaxSAT. We present timed results on a subset of problem instances from the annual MaxSAT Evaluation's Incomplete Unweighted Track for the years 2018 to 2021. When allotted the same running time and CPU compute budget (but no TPUs), RbmSAT outperforms other participating solvers on problems drawn from three out of the four years' competitions. Given the same running time on a TPU cluster for which RbmSAT is uniquely designed, it outperforms all solvers on problems drawn from all four years.
    Embodied Lifelong Learning for Task and Motion Planning. (arXiv:2307.06870v2 [cs.RO] UPDATED)
    A robot deployed in a home over long stretches of time faces a true lifelong learning problem. As it seeks to provide assistance to its users, the robot should leverage any accumulated experience to improve its own knowledge and proficiency. We formalize this setting with a novel formulation of lifelong learning for task and motion planning (TAMP), which endows our learner with the compositionality of TAMP systems. Exploiting the modularity of TAMP, we develop a mixture of generative models that produces candidate continuous parameters for a planner. Whereas most existing lifelong learning approaches determine a priori how data is shared across various models, our approach learns shared and non-shared models and determines which to use online during planning based on auxiliary tasks that serve as a proxy for each model's understanding of a state. Our method exhibits substantial improvements (over time and compared to baselines) in planning success on 2D and BEHAVIOR domains.  ( 2 min )
    Efficient Symbolic Policy Learning with Differentiable Symbolic Expression. (arXiv:2311.02104v1 [cs.LG])
    Deep reinforcement learning (DRL) has led to a wide range of advances in sequential decision-making tasks. However, the complexity of neural network policies makes it difficult to understand and deploy with limited computational resources. Currently, employing compact symbolic expressions as symbolic policies is a promising strategy to obtain simple and interpretable policies. Previous symbolic policy methods usually involve complex training processes and pre-trained neural network policies, which are inefficient and limit the application of symbolic policies. In this paper, we propose an efficient gradient-based learning method named Efficient Symbolic Policy Learning (ESPL) that learns the symbolic policy from scratch in an end-to-end way. We introduce a symbolic network as the search space and employ a path selector to find the compact symbolic policy. By doing so we represent the policy with a differentiable symbolic expression and train it in an off-policy manner which further improves the efficiency. In addition, in contrast with previous symbolic policies which only work in single-task RL because of complexity, we expand ESPL on meta-RL to generate symbolic policies for unseen tasks. Experimentally, we show that our approach generates symbolic policies with higher performance and greatly improves data efficiency for single-task RL. In meta-RL, we demonstrate that compared with neural network policies the proposed symbolic policy achieves higher performance and efficiency and shows the potential to be interpretable.
    FLSL: Feature-level Self-supervised Learning. (arXiv:2306.06203v4 [cs.LG] UPDATED)
    Current self-supervised learning (SSL) methods (e.g., SimCLR, DINO, VICReg,MOCOv3) target primarily on representations at instance level and do not generalize well to dense prediction tasks, such as object detection and segmentation.Towards aligning SSL with dense predictions, this paper demonstrates for the first time the underlying mean-shift clustering process of Vision Transformers (ViT), which aligns well with natural image semantics (e.g., a world of objects and stuffs). By employing transformer for joint embedding and clustering, we propose a two-level feature clustering SSL method, coined Feature-Level Self-supervised Learning (FLSL). We present the formal definition of the FLSL problem and construct the objectives from the mean-shift and k-means perspectives. We show that FLSL promotes remarkable semantic cluster representations and learns an embedding scheme amenable to intra-view and inter-view feature clustering. Experiments show that FLSL yields significant improvements in dense prediction tasks, achieving 44.9 (+2.8)% AP and 46.5% AP in object detection, as well as 40.8 (+2.3)% AP and 42.1% AP in instance segmentation on MS-COCO, using Mask R-CNN with ViT-S/16 and ViT-S/8 as backbone, respectively. FLSL consistently outperforms existing SSL methods across additional benchmarks, including UAV17 object detection on UAVDT, and video instance segmentation on DAVIS 2017.We conclude by presenting visualization and various ablation studies to better understand the success of FLSL. The source code is available at https://github.com/ISL-CV/FLSL.
    DiffLoad: Uncertainty Quantification in Load Forecasting with Diffusion Model. (arXiv:2306.01001v2 [cs.LG] UPDATED)
    Electrical load forecasting plays a crucial role in decision-making for power systems, including unit commitment and economic dispatch. The integration of renewable energy sources and the occurrence of external events, such as the COVID-19 pandemic, have rapidly increased uncertainties in load forecasting. The uncertainties in load forecasting can be divided into two types: epistemic uncertainty and aleatoric uncertainty. Separating these types of uncertainties can help decision-makers better understand where and to what extent the uncertainty is, thereby enhancing their confidence in the following decision-making. This paper proposes a diffusion-based Seq2Seq structure to estimate epistemic uncertainty and employs the robust additive Cauchy distribution to estimate aleatoric uncertainty. Our method not only ensures the accuracy of load forecasting but also demonstrates the ability to separate the two types of uncertainties and be applicable to different levels of loads. The relevant code can be found at \url{https://anonymous.4open.science/r/DiffLoad-4714/}.  ( 2 min )
    CAT-Walk: Inductive Hypergraph Learning via Set Walks. (arXiv:2306.11147v2 [cs.LG] UPDATED)
    Temporal hypergraphs provide a powerful paradigm for modeling time-dependent, higher-order interactions in complex systems. Representation learning for hypergraphs is essential for extracting patterns of the higher-order interactions that are critically important in real-world problems in social network analysis, neuroscience, finance, etc. However, existing methods are typically designed only for specific tasks or static hypergraphs. We present CAT-Walk, an inductive method that learns the underlying dynamic laws that govern the temporal and structural processes underlying a temporal hypergraph. CAT-Walk introduces a temporal, higher-order walk on hypergraphs, SetWalk, that extracts higher-order causal patterns. CAT-Walk uses a novel adaptive and permutation invariant pooling strategy, SetMixer, along with a set-based anonymization process that hides the identity of hyperedges. Finally, we present a simple yet effective neural network model to encode hyperedges. Our evaluation on 10 hypergraph benchmark datasets shows that CAT-Walk attains outstanding performance on temporal hyperedge prediction benchmarks in both inductive and transductive settings. It also shows competitive performance with state-of-the-art methods for node classification. (https://github.com/ubc-systopia/CATWalk)  ( 2 min )
    Benchmarking Foundation Models with Language-Model-as-an-Examiner. (arXiv:2306.04181v2 [cs.CL] UPDATED)
    Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model's ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and evaluation automation. In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner. Our framework allows for effortless extensibility as various LMs can be adopted as the examiner, and the questions can be constantly updated given more diverse trigger topics. For a more comprehensive and equitable evaluation, we devise three strategies: (1) We instruct the LM examiner to generate questions across a multitude of domains to probe for a broad acquisition, and raise follow-up questions to engage in a more in-depth assessment. (2) Upon evaluation, the examiner combines both scoring and ranking measurements, providing a reliable result as it aligns closely with human annotations. (3) We additionally propose a decentralized Peer-examination method to address the biases in a single examiner. Our data and benchmarking results are available at: this http URL  ( 2 min )
    Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation. (arXiv:2305.05803v4 [cs.CV] UPDATED)
    Weakly supervised semantic segmentation (WSSS) aims to bypass the need for laborious pixel-level annotation by using only image-level annotation. Most existing methods rely on Class Activation Maps (CAM) to derive pixel-level pseudo-labels and use them to train a fully supervised semantic segmentation model. Although these pseudo-labels are class-aware, indicating the coarse regions for particular classes, they are not object-aware and fail to delineate accurate object boundaries. To address this, we introduce a simple yet effective method harnessing the Segment Anything Model (SAM), a class-agnostic foundation model capable of producing fine-grained instance masks of objects, parts, and subparts. We use CAM pseudo-labels as cues to select and combine SAM masks, resulting in high-quality pseudo-labels that are both class-aware and object-aware. Our approach is highly versatile and can be easily integrated into existing WSSS methods without any modification. Despite its simplicity, our approach shows consistent gain over the state-of-the-art WSSS methods on both PASCAL VOC and MS-COCO datasets.  ( 2 min )
    GAD-NR: Graph Anomaly Detection via Neighborhood Reconstruction. (arXiv:2306.01951v5 [cs.LG] UPDATED)
    Graph Anomaly Detection (GAD) is a technique used to identify abnormal nodes within graphs, finding applications in network security, fraud detection, social media spam detection, and various other domains. A common method for GAD is Graph Auto-Encoders (GAEs), which encode graph data into node representations and identify anomalies by assessing the reconstruction quality of the graphs based on these representations. However, existing GAE models are primarily optimized for direct link reconstruction, resulting in nodes connected in the graph being clustered in the latent space. As a result, they excel at detecting cluster-type structural anomalies but struggle with more complex structural anomalies that do not conform to clusters. To address this limitation, we propose a novel solution called GAD-NR, a new variant of GAE that incorporates neighborhood reconstruction for graph anomaly detection. GAD-NR aims to reconstruct the entire neighborhood of a node, encompassing the local structure, self-attributes, and neighbor attributes, based on the corresponding node representation. By comparing the neighborhood reconstruction loss between anomalous nodes and normal nodes, GAD-NR can effectively detect any anomalies. Extensive experimentation conducted on six real-world datasets validates the effectiveness of GAD-NR, showcasing significant improvements (by up to 30% in AUC) over state-of-the-art competitors. The source code for GAD-NR is openly available. Importantly, the comparative analysis reveals that the existing methods perform well only in detecting one or two types of anomalies out of the three types studied. In contrast, GAD-NR excels at detecting all three types of anomalies across the datasets, demonstrating its comprehensive anomaly detection capabilities.
    C-STS: Conditional Semantic Textual Similarity. (arXiv:2305.15093v2 [cs.CL] UPDATED)
    Semantic textual similarity (STS), a cornerstone task in NLP, measures the degree of similarity between a pair of sentences, and has broad application in fields such as information retrieval and natural language understanding. However, sentence similarity can be inherently ambiguous, depending on the specific aspect of interest. We resolve this ambiguity by proposing a novel task called Conditional STS (C-STS) which measures sentences' similarity conditioned on an feature described in natural language (hereon, condition). As an example, the similarity between the sentences "The NBA player shoots a three-pointer." and "A man throws a tennis ball into the air to serve." is higher for the condition "The motion of the ball" (both upward) and lower for "The size of the ball" (one large and one small). C-STS's advantages are two-fold: (1) it reduces the subjectivity and ambiguity of STS and (2) enables fine-grained language model evaluation through diverse natural language conditions. We put several state-of-the-art models to the test, and even those performing well on STS (e.g. SimCSE, Flan-T5, and GPT-4) find C-STS challenging; all with Spearman correlation scores below 50. To encourage a more comprehensive evaluation of semantic similarity and natural language understanding, we make nearly 19K C-STS examples and code available for others to train and test their models.
    Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model. (arXiv:2306.05720v2 [cs.CV] UPDATED)
    Latent diffusion models (LDMs) exhibit an impressive ability to produce realistic images, yet the inner workings of these models remain mysterious. Even when trained purely on images without explicit depth information, they typically output coherent pictures of 3D scenes. In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple scene geometry? Using linear probes, we find evidence that the internal activations of the LDM encode linear representations of both 3D depth data and a salient-object / background distinction. These representations appear surprisingly early in the denoising process$-$well before a human can easily make sense of the noisy images. Intervention experiments further indicate these representations play a causal role in image synthesis, and may be used for simple high-level editing of an LDM's output. Project page: https://yc015.github.io/scene-representation-diffusion-model/
    Continually Improving Extractive QA via Human Feedback. (arXiv:2305.12473v2 [cs.CL] UPDATED)
    We study continually improving an extractive question answering (QA) system via human user feedback. We design and deploy an iterative approach, where information-seeking users ask questions, receive model-predicted answers, and provide feedback. We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time. Our experiments show effective improvement from user feedback of extractive QA models over time across different data regimes, including significant potential for domain adaptation.
    Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning. (arXiv:2310.20587v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) aims to find a near-optimal policy using pre-collected datasets. In real-world scenarios, data collection could be costly and risky; therefore, offline RL becomes particularly challenging when the in-domain data is limited. Given recent advances in Large Language Models (LLMs) and their few-shot learning prowess, this paper introduces $\textbf{La}$nguage Models for $\textbf{Mo}$tion Control ($\textbf{LaMo}$), a general framework based on Decision Transformers to effectively use pre-trained Language Models (LMs) for offline RL. Our framework highlights four crucial components: (1) Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method, in contrast to full-weight fine-tuning, to combine the pre-trained knowledge from LMs and in-domain knowledge effectively, (3) using the non-linear MLP transformation instead of linear projections, to generate embeddings, and (4) integrating an auxiliary language prediction loss during fine-tuning to stabilize the LMs and retain their original abilities on languages. Empirical results indicate $\textbf{LaMo}$ achieves state-of-the-art performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reward tasks. In particular, our method demonstrates superior performance in scenarios with limited data samples. Our project website is $\href{https://lamo2023.github.io}{\text{this https URL}}$.
    Differentially Private Federated Clustering over Non-IID Data. (arXiv:2301.00955v3 [cs.DC] CROSS LISTED)
    In this paper, we investigate federated clustering (FedC) problem, that aims to accurately partition unlabeled data samples distributed over massive clients into finite clusters under the orchestration of a parameter server, meanwhile considering data privacy. Though it is an NP-hard optimization problem involving real variables denoting cluster centroids and binary variables denoting the cluster membership of each data sample, we judiciously reformulate the FedC problem into a non-convex optimization problem with only one convex constraint, accordingly yielding a soft clustering solution. Then a novel FedC algorithm using differential privacy (DP) technique, referred to as DP-FedC, is proposed in which partial clients participation and multiple local model updating steps are also considered. Furthermore, various attributes of the proposed DP-FedC are obtained through theoretical analyses of privacy protection and convergence rate, especially for the case of non-identically and independently distributed (non-i.i.d.) data, that ideally serve as the guidelines for the design of the proposed DP-FedC. Then some experimental results on two real datasets are provided to demonstrate the efficacy of the proposed DP-FedC together with its much superior performance over some state-of-the-art FedC algorithms, and the consistency with all the presented analytical results.
    Tight conditions for when the NTK approximation is valid. (arXiv:2305.13141v3 [cs.LG] UPDATED)
    We study when the neural tangent kernel (NTK) approximation is valid for training a model with the square loss. In the lazy training setting of Chizat et al. 2019, we show that rescaling the model by a factor of $\alpha = O(T)$ suffices for the NTK approximation to be valid until training time $T$. Our bound is tight and improves on the previous bound of Chizat et al. 2019, which required a larger rescaling factor of $\alpha = O(T^2)$.
    Reduce Computational Complexity for Convolutional Layers by Skipping Zeros. (arXiv:2306.15951v3 [cs.LG] UPDATED)
    Convolutional neural networks necessitate good algorithms to reduce complexity, and sufficient utilization of parallel processors for acceleration. Within convolutional layers, there are three types of operators: convolution used in forward propagation, deconvolution and dilated-convolution utilized in backward propagation. During the execution of these operators, zeros are typically added to tensors, leading to redundant calculations and unnecessary strain on hardware. To circumvent these inefficiencies, we propose the C-K-S algorithm, accompanied by efficient GPU implementations. C-K-S trims filters to exclude zero-padding. For deconvolution and dilated-convolution, C-K-S transforms sparse tensors into dense tensors, and standardizes the local computational rules to simplify the hardware control. The experimental results demonstrate that C-K-S offers good performance in terms of speed and convergence, surpassing the capabilities of PyTorch and cuDNN in certain scenarios.  ( 2 min )
    Moment Matching Denoising Gibbs Sampling. (arXiv:2305.11650v2 [stat.ML] UPDATED)
    Energy-Based Models (EBMs) offer a versatile framework for modeling complex data distributions. However, training and sampling from EBMs continue to pose significant challenges. The widely-used Denoising Score Matching (DSM) method for scalable EBM training suffers from inconsistency issues, causing the energy model to learn a `noisy' data distribution. In this work, we propose an efficient sampling framework: (pseudo)-Gibbs sampling with moment matching, which enables effective sampling from the underlying clean model when given a `noisy' model that has been well-trained via DSM. We explore the benefits of our approach compared to related methods and demonstrate how to scale the method to high-dimensional datasets.
    Bridging RL Theory and Practice with the Effective Horizon. (arXiv:2304.09853v2 [cs.LG] UPDATED)
    Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We choose to focus on deterministic environments because they share many interesting properties of stochastic environments, but are easier to analyze. Using BRIDGE, we find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy (i.e. when it is optimal to be greedy on the random policy's Q function), deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon
    Multilevel Diffusion: Infinite Dimensional Score-Based Diffusion Models for Image Generation. (arXiv:2303.04772v3 [cs.LG] UPDATED)
    Score-based diffusion models (SBDM) have recently emerged as state-of-the-art approaches for image generation. Existing SBDMs are typically formulated in a finite-dimensional setting, where images are considered as tensors of finite size. This paper develops SBDMs in the infinite-dimensional setting, that is, we model the training data as functions supported on a rectangular domain. Besides the quest for generating images at ever higher resolution, our primary motivation is to create a well-posed infinite-dimensional learning problem so that we can discretize it consistently on multiple resolution levels. We thereby intend to obtain diffusion models that generalize across different resolution levels and improve the efficiency of the training process. We demonstrate how to overcome two shortcomings of current SBDM approaches in the infinite-dimensional setting. First, we modify the forward process to ensure that the latent distribution is well-defined in the infinite-dimensional setting using the notion of trace class operators. We derive the reverse processes for finite approximations. Second, we illustrate that approximating the score function with an operator network is beneficial for multilevel training. After deriving the convergence of the discretization and the approximation of multilevel training, we implement an infinite-dimensional SBDM approach and show the first promising results on MNIST and Fashion-MNIST, underlining our developed theory.
    Detecting Language Model Attacks with Perplexity. (arXiv:2308.14132v2 [cs.CL] UPDATED)
    A novel hack involving Large Language Models (LLMs) has emerged, leveraging adversarial suffixes to trick models into generating perilous responses. This method has garnered considerable attention from reputable media outlets such as the New York Times and Wired, thereby influencing public perception regarding the security and safety of LLMs. In this study, we advocate the utilization of perplexity as one of the means to recognize such potential attacks. The underlying concept behind these hacks revolves around appending an unusually constructed string of text to a harmful query that would otherwise be blocked. This maneuver confuses the protective mechanisms and tricks the model into generating a forbidden response. Such scenarios could result in providing detailed instructions to a malicious user for constructing explosives or orchestrating a bank heist. Our investigation demonstrates the feasibility of employing perplexity, a prevalent natural language processing metric, to detect these adversarial tactics before generating a forbidden response. By evaluating the perplexity of queries with and without such adversarial suffixes using an open-source LLM, we discovered that nearly 90 percent were above a perplexity of 1000. This contrast underscores the efficacy of perplexity for detecting this type of exploit.
    An Operator Learning Framework for Spatiotemporal Super-resolution of Scientific Simulations. (arXiv:2311.02328v1 [cs.LG])
    In numerous contexts, high-resolution solutions to partial differential equations are required to capture faithfully essential dynamics which occur at small spatiotemporal scales, but these solutions can be very difficult and slow to obtain using traditional methods due to limited computational resources. A recent direction to circumvent these computational limitations is to use machine learning techniques for super-resolution, to reconstruct high-resolution numerical solutions from low-resolution simulations which can be obtained more efficiently. The proposed approach, the Super Resolution Operator Network (SROpNet), frames super-resolution as an operator learning problem and draws inspiration from existing architectures to learn continuous representations of solutions to parametric differential equations from low-resolution approximations, which can then be evaluated at any desired location. In addition, no restrictions are imposed on the locations of (the fixed number of) spatiotemporal sensors at which the low-resolution approximations are provided, thereby enabling the consideration of a broader spectrum of problems arising in practice, for which many existing super-resolution approaches are not well-suited.
    Data-Dependent Bounds for Online Portfolio Selection Without Lipschitzness and Smoothness. (arXiv:2305.13946v2 [cs.LG] UPDATED)
    This work introduces the first small-loss and gradual-variation regret bounds for online portfolio selection, marking the first instances of data-dependent bounds for online convex optimization with non-Lipschitz, non-smooth losses. The algorithms we propose exhibit sublinear regret rates in the worst cases and achieve logarithmic regrets when the data is "easy," with per-iteration time almost linear in the number of investment alternatives. The regret bounds are derived using novel smoothness characterizations of the logarithmic loss, a local norm-based analysis of following the regularized leader (FTRL) with self-concordant regularizers, which are not necessarily barriers, and an implicit variant of optimistic FTRL with the log-barrier.  ( 2 min )
    Multi-label Classification with High-rank and High-order Label Correlations. (arXiv:2207.04197v2 [cs.LG] UPDATED)
    Exploiting label correlations is important to multi-label classification. Previous methods capture the high-order label correlations mainly by transforming the label matrix to a latent label space with low-rank matrix factorization. However, the label matrix is generally a full-rank or approximate full-rank matrix, making the low-rank factorization inappropriate. Besides, in the latent space, the label correlations will become implicit. To this end, we propose a simple yet effective method to depict the high-order label correlations explicitly, and at the same time maintain the high-rank of the label matrix. Moreover, we estimate the label correlations and infer model parameters simultaneously via the local geometric structure of the input to achieve mutual enhancement. Comparative studies over twelve benchmark data sets validate the effectiveness of the proposed algorithm in multi-label classification. The exploited high-order label correlations are consistent with common sense empirically. Our code is publicly available at https://github.com/Chongjie-Si/HOMI.
    Transfer-Learning Across Datasets with Different Input Dimensions: An Algorithm and Analysis for the Linear Regression Case. (arXiv:2202.05069v4 [stat.ML] UPDATED)
    With the development of new sensors and monitoring devices, more sources of data become available to be used as inputs for machine learning models. These can on the one hand help to improve the accuracy of a model. On the other hand, combining these new inputs with historical data remains a challenge that has not yet been studied in enough detail. In this work, we propose a transfer learning algorithm that combines new and historical data with different input dimensions. This approach is easy to implement, efficient, with computational complexity equivalent to the ordinary least-squares method, and requires no hyperparameter tuning, making it straightforward to apply when the new data is limited. Different from other approaches, we provide a rigorous theoretical study of its robustness, showing that it cannot be outperformed by a baseline that utilizes only the new data. Our approach achieves state-of-the-art performance on 9 real-life datasets, outperforming the linear DSFT, another linear transfer learning algorithm, and performing comparably to non-linear DSFT.
    Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. (arXiv:2302.06960v3 [stat.ML] UPDATED)
    Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e., where a fraction of $30\%$ or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws; in [Sorscher et al.], the authors showed the need for high-quality data pruning algorithms in order to beat the sample power law. In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate ``No Free Lunch" theorems for data pruning and present calibration protocols that enhance the performance of existing pruning algorithms in this high compression regime using randomization.
    VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models. (arXiv:2306.06874v3 [cs.CR] UPDATED)
    Diffusion Models (DMs) are state-of-the-art generative models that learn a reversible corruption process from iterative noise addition and denoising. They are the backbone of many generative AI applications, such as text-to-image conditional generation. However, recent studies have shown that basic unconditional DMs (e.g., DDPM and DDIM) are vulnerable to backdoor injection, a type of output manipulation attack triggered by a maliciously embedded pattern at model input. This paper presents a unified backdoor attack framework (VillanDiffusion) to expand the current scope of backdoor analysis for DMs. Our framework covers mainstream unconditional and conditional DMs (denoising-based and score-based) and various training-free samplers for holistic evaluations. Experiments show that our unified framework facilitates the backdoor analysis of different DM configurations and provides new insights into caption-based backdoor attacks on DMs. Our code is available on GitHub: \url{https://github.com/IBM/villandiffusion}  ( 2 min )
    A Novel Site-Agnostic Multimodal Deep Learning Model to Identify Pro-Eating Disorder Content on Social Media. (arXiv:2307.06775v4 [cs.LG] UPDATED)
    Over the last decade, there has been a vast increase in eating disorder diagnoses and eating disorder-attributed deaths, reaching their zenith during the Covid-19 pandemic. This immense growth derived in part from the stressors of the pandemic but also from increased exposure to social media, which is rife with content that promotes eating disorders. This study aimed to create a multimodal deep learning model that can determine if a given social media post promotes eating disorders based on a combination of visual and textual data. A labeled dataset of Tweets was collected from Twitter, recently rebranded as X, upon which twelve deep learning models were trained and evaluated. Based on model performance, the most effective deep learning model was the multimodal fusion of the RoBERTa natural language processing model and the MaxViT image classification model, attaining accuracy and F1 scores of 95.9% and 0.959, respectively. The RoBERTa and MaxViT fusion model, deployed to classify an unlabeled dataset of posts from the social media sites Tumblr and Reddit, generated results akin to those of previous research studies that did not employ artificial intelligence-based techniques, indicating that deep learning models can develop insights congruent to those of researchers. Additionally, the model was used to conduct a time-series analysis of yet unseen Tweets from eight Twitter hashtags, uncovering that, since 2014, the relative abundance of content that promotes eating disorders has decreased drastically within those communities. Despite this reduction, by 2018, content that promotes eating disorders had either stopped declining or increased in ampleness anew on those hashtags.  ( 3 min )
    Deep Learning with Kernels through RKHM and the Perron-Frobenius Operator. (arXiv:2305.13588v2 [stat.ML] UPDATED)
    Reproducing kernel Hilbert $C^*$-module (RKHM) is a generalization of reproducing kernel Hilbert space (RKHS) by means of $C^*$-algebra, and the Perron-Frobenius operator is a linear operator related to the composition of functions. Combining these two concepts, we present deep RKHM, a deep learning framework for kernel methods. We derive a new Rademacher generalization bound in this setting and provide a theoretical interpretation of benign overfitting by means of Perron-Frobenius operators. By virtue of $C^*$-algebra, the dependency of the bound on output dimension is milder than existing bounds. We show that $C^*$-algebra is a suitable tool for deep learning with kernels, enabling us to take advantage of the product structure of operators and to provide a clear connection with convolutional neural networks. Our theoretical analysis provides a new lens through which one can design and analyze deep kernel methods.
    Bi-directional Training for Composed Image Retrieval via Text Prompt Learning. (arXiv:2303.16604v2 [cs.CV] UPDATED)
    Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text describing the desired changes. Existing approaches to solving this challenging task learn a mapping from the (reference image, modification text)-pair to an image embedding that is then matched against a large image corpus. One area that has not yet been explored is the reverse direction, which asks the question, what reference image when modified as described by the text would produce the given target image? In this work we propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures with minimum changes, which improves the performance of the model. To encode the bi-directional query we prepend a learnable token to the modification text that designates the direction of the query and then finetune the parameters of the text embedding module. We make no other changes to the network architecture. Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model that itself already achieves competitive performance. Our code is released at https://github.com/Cuberick-Orion/Bi-Blip4CIR.
    Is RLHF More Difficult than Standard RL?. (arXiv:2306.14111v2 [cs.LG] UPDATED)
    Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. Preferences arguably contain less information than rewards, which makes preference-based RL seemingly more difficult. This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games with a restricted set of policies. The latter case can be further reduced to adversarial MDP when preferences only depend on the final state. We instantiate all reward-based RL subroutines by concrete provable algorithms, and apply our theory to a large class of models including tabular MDPs and MDPs with generic function approximation. We further provide guarantees when K-wise comparisons are available.  ( 2 min )
    BatmanNet: Bi-branch Masked Graph Transformer Autoencoder for Molecular Representation. (arXiv:2211.13979v3 [cs.LG] UPDATED)
    Although substantial efforts have been made using graph neural networks (GNNs) for AI-driven drug discovery (AIDD), effective molecular representation learning remains an open challenge, especially in the case of insufficient labeled molecules. Recent studies suggest that big GNN models pre-trained by self-supervised learning on unlabeled datasets enable better transfer performance in downstream molecular property prediction tasks. However, the approaches in these studies require multiple complex self-supervised tasks and large-scale datasets, which are time-consuming, computationally expensive, and difficult to pre-train end-to-end. Here, we design a simple yet effective self-supervised strategy to simultaneously learn local and global information about molecules, and further propose a novel bi-branch masked graph transformer autoencoder (BatmanNet) to learn molecular representations. BatmanNet features two tailored complementary and asymmetric graph autoencoders to reconstruct the missing nodes and edges, respectively, from a masked molecular graph. With this design, BatmanNet can effectively capture the underlying structure and semantic information of molecules, thus improving the performance of molecular representation. BatmanNet achieves state-of-the-art results for multiple drug discovery tasks, including molecular properties prediction, drug-drug interaction, and drug-target interaction, on 13 benchmark datasets, demonstrating its great potential and superiority in molecular representation learning.
    Using multimodal learning and deep generative models for corporate bankruptcy prediction. (arXiv:2211.08405v4 [q-fin.RM] UPDATED)
    Textual data from financial filings, e.g., the Management's Discussion \& Analysis (MDA) section in Form 10-K, has been used to improve the prediction accuracy of bankruptcy models. In practice, however, we cannot obtain the MDA section for all public companies. The two main reasons for the lack of MDA are: (i) not all companies are obliged to submit the MDA and (ii) technical problems arise when crawling and scrapping the MDA section. This research introduces for the first time, to the best of our knowledge, the concept of multimodal learning in bankruptcy prediction models to solve the problem that for some companies we are unable to obtain the MDA text. We use the Conditional Multimodal Discriminative (CMMD) model to learn multimodal representations that embed information from accounting, market, and textual modalities. The CMMD model needs a sample with all data modalities for model training. At test time, the CMMD model only needs access to accounting and market modalities to generate multimodal representations, which are further used to make bankruptcy predictions. This fact makes the use of bankruptcy prediction models using textual data realistic and possible, since accounting and market data are available for all companies unlike textual data. The empirical results in this research show that the classification performance of our proposed methodology is superior compared to that of a large number of traditional classifier models. We also show that our proposed methodology solves the limitation of previous bankruptcy models using textual data, as they can only make predictions for a small proportion of companies.
    Fast Kernel Methods for Generic Lipschitz Losses via $p$-Sparsified Sketches. (arXiv:2206.03827v7 [stat.ML] UPDATED)
    Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, which consists in looking for solutions among a subspace of reduced dimension, is a well studied approach to alleviate these computational burdens. However, statistically-accurate sketches, such as the Gaussian one, usually contain few null entries, such that their application to kernel methods and their non-sparse Gram matrices remains slow in practice. In this paper, we show that sparsified Gaussian (and Rademacher) sketches still produce theoretically-valid approximations while allowing for important time and space savings thanks to an efficient \emph{decomposition trick}. To support our method, we derive excess risk bounds for both single and multiple output kernel problems, with generic Lipschitz losses, hereby providing new guarantees for a wide range of applications, from robust regression to multiple quantile regression. Our theoretical results are complemented with experiments showing the empirical superiority of our approach over SOTA sketching methods.
    Estimation and inference for transfer learning with high-dimensional quantile regression. (arXiv:2211.14578v3 [stat.ML] UPDATED)
    Transfer learning has become an essential technique to exploit information from the source domain to boost performance of the target task. Despite the prevalence in high-dimensional data, heterogeneity and heavy tails are insufficiently accounted for by current transfer learning approaches and thus may undermine the resulting performance. We propose a transfer learning procedure in the framework of high-dimensional quantile regression models to accommodate heterogeneity and heavy tails in the source and target domains. We establish error bounds of transfer learning estimator based on delicately selected transferable source domains, showing that lower error bounds can be achieved for critical selection criterion and larger sample size of source tasks. We further propose valid confidence interval and hypothesis test procedures for individual component of high-dimensional quantile regression coefficients by advocating a double transfer learning estimator, which is one-step debiased estimator for the transfer learning estimator wherein the technique of transfer learning is designed again. By adopting data-splitting technique, we advocate a transferability detection approach that guarantees to circumvent negative transfer and identify transferable sources with high probability. Simulation results demonstrate that the proposed method exhibits some favorable and compelling performances and the practical utility is further illustrated by analyzing a real example.
    PRISM: Progressive Restoration for Scene Graph-based Image Manipulation. (arXiv:2311.02247v1 [cs.LG])
    Scene graphs have emerged as accurate descriptive priors for image generation and manipulation tasks, however, their complexity and diversity of the shapes and relations of objects in data make it challenging to incorporate them into the models and generate high-quality results. To address these challenges, we propose PRISM, a novel progressive multi-head image manipulation approach to improve the accuracy and quality of the manipulated regions in the scene. Our image manipulation framework is trained using an end-to-end denoising masked reconstruction proxy task, where the masked regions are progressively unmasked from the outer regions to the inner part. We take advantage of the outer part of the masked area as they have a direct correlation with the context of the scene. Moreover, our multi-head architecture simultaneously generates detailed object-specific regions in addition to the entire image to produce higher-quality images. Our model outperforms the state-of-the-art methods in the semantic image manipulation task on the CLEVR and Visual Genome datasets. Our results demonstrate the potential of our approach for enhancing the quality and precision of scene graph-based image manipulation.
    FragXsiteDTI: Revealing Responsible Segments in Drug-Target Interaction with Transformer-Driven Interpretation. (arXiv:2311.02326v1 [cs.LG])
    Drug-Target Interaction (DTI) prediction is vital for drug discovery, yet challenges persist in achieving model interpretability and optimizing performance. We propose a novel transformer-based model, FragXsiteDTI, that aims to address these challenges in DTI prediction. Notably, FragXsiteDTI is the first DTI model to simultaneously leverage drug molecule fragments and protein pockets. Our information-rich representations for both proteins and drugs offer a detailed perspective on their interaction. Inspired by the Perceiver IO framework, our model features a learnable latent array, initially interacting with protein binding site embeddings using cross-attention and later refined through self-attention and used as a query to the drug fragments in the drug's cross-attention transformer block. This learnable query array serves as a mediator and enables seamless information translation, preserving critical nuances in drug-protein interactions. Our computational results on three benchmarking datasets demonstrate the superior predictive power of our model over several state-of-the-art models. We also show the interpretability of our model in terms of the critical components of both target proteins and drug molecules within drug-target pairs.
    Error-bounded Approximate Time Series Joins Using Compact Dictionary Representations of Time Series. (arXiv:2112.12965v2 [cs.DB] UPDATED)
    The matrix profile is an effective data mining tool that provides similarity join functionality for time series data. Users of the matrix profile can either join a time series with itself using intra-similarity join (i.e., self-join) or join a time series with another time series using inter-similarity join. By invoking either or both types of joins, the matrix profile can help users discover both conserved and anomalous structures in the data. Since the introduction of the matrix profile five years ago, multiple efforts have been made to speed up the computation with approximate joins; however, the majority of these efforts only focus on self-joins. In this work, we show that it is possible to efficiently perform approximate inter-time series similarity joins with error bounded guarantees by creating a compact "dictionary" representation of time series. Using the dictionary representation instead of the original time series, we are able to improve the throughput of an anomaly mining system by at least 20X, with essentially no decrease in accuracy. As a side effect, the dictionaries also summarize the time series in a semantically meaningful way and can provide intuitive and actionable insights. We demonstrate the utility of our dictionary-based inter-time series similarity joins on domains as diverse as medicine and transportation.
    Generative Adversarial Networks to infer velocity components in rotating turbulent flows. (arXiv:2301.07541v2 [physics.flu-dyn] UPDATED)
    Inference problems for two-dimensional snapshots of rotating turbulent flows are studied. We perform a systematic quantitative benchmark of point-wise and statistical reconstruction capabilities of the linear Extended Proper Orthogonal Decomposition (EPOD) method, a non-linear Convolutional Neural Network (CNN) and a Generative Adversarial Network (GAN). We attack the important task of inferring one velocity component out of the measurement of a second one, and two cases are studied: (I) both components lay in the plane orthogonal to the rotation axis and (II) one of the two is parallel to the rotation axis. We show that EPOD method works well only for the former case where both components are strongly correlated, while CNN and GAN always outperform EPOD both concerning point-wise and statistical reconstructions. For case (II), when the input and output data are weakly correlated, all methods fail to reconstruct faithfully the point-wise information. In this case, only GAN is able to reconstruct the field in a statistical sense. The analysis is performed using both standard validation tools based on $L_2$ spatial distance between the prediction and the ground truth and more sophisticated multi-scale analysis using wavelet decomposition. Statistical validation is based on standard Jensen-Shannon divergence between the probability density functions, spectral properties and multi-scale flatness.
    Successive Model-Agnostic Meta-Learning for Few-Shot Fault Time Series Prognosis. (arXiv:2311.02300v1 [cs.LG])
    Meta learning is a promising technique for solving few-shot fault prediction problems, which have attracted the attention of many researchers in recent years. Existing meta-learning methods for time series prediction, which predominantly rely on random and similarity matching-based task partitioning, face three major limitations: (1) feature exploitation inefficiency; (2) suboptimal task data allocation; and (3) limited robustness with small samples. To overcome these limitations, we introduce a novel 'pseudo meta-task' partitioning scheme that treats a continuous time period of a time series as a meta-task, composed of multiple successive short time periods. Employing continuous time series as pseudo meta-tasks allows our method to extract more comprehensive features and relationships from the data, resulting in more accurate predictions. Moreover, we introduce a differential algorithm to enhance the robustness of our method across different datasets. Through extensive experiments on several fault and time series prediction datasets, we demonstrate that our approach substantially enhances prediction performance and generalization capability under both few-shot and general conditions.
    MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning. (arXiv:2311.02303v1 [cs.LG])
    Code LLMs have emerged as a specialized research field, with remarkable studies dedicated to enhancing model's coding capabilities through fine-tuning on pre-trained models. Previous fine-tuning approaches were typically tailored to specific downstream tasks or scenarios, which meant separate fine-tuning for each task, requiring extensive training resources and posing challenges in terms of deployment and maintenance. Furthermore, these approaches failed to leverage the inherent interconnectedness among different code-related tasks. To overcome these limitations, we present a multi-task fine-tuning framework, MFTcoder, that enables simultaneous and parallel fine-tuning on multiple tasks. By incorporating various loss functions, we effectively address common challenges in multi-task learning, such as data imbalance, varying difficulty levels, and inconsistent convergence speeds. Extensive experiments have conclusively demonstrated that our multi-task fine-tuning approach outperforms both individual fine-tuning on single tasks and fine-tuning on a mixed ensemble of tasks. Moreover, MFTcoder offers efficient training capabilities, including efficient data tokenization modes and PEFT fine-tuning, resulting in significantly improved speed compared to traditional fine-tuning methods. MFTcoder seamlessly integrates with several mainstream open-source LLMs, such as CodeLLama and Qwen. Leveraging the CodeLLama foundation, our MFTcoder fine-tuned model, \textsc{CodeFuse-CodeLLama-34B}, achieves an impressive pass@1 score of 74.4\% on the HumaneEval benchmark, surpassing GPT-4 performance (67\%, zero-shot). MFTCoder is open-sourced at \url{https://github.com/codefuse-ai/MFTCOder}
    Multi-scale Time-stepping of Partial Differential Equations with Transformers. (arXiv:2311.02225v1 [cs.LG])
    Developing fast surrogates for Partial Differential Equations (PDEs) will accelerate design and optimization in almost all scientific and engineering applications. Neural networks have been receiving ever-increasing attention and demonstrated remarkable success in computational modeling of PDEs, however; their prediction accuracy is not at the level of full deployment. In this work, we utilize the transformer architecture, the backbone of numerous state-of-the-art AI models, to learn the dynamics of physical systems as the mixing of spatial patterns learned by a convolutional autoencoder. Moreover, we incorporate the idea of multi-scale hierarchical time-stepping to increase the prediction speed and decrease accumulated error over time. Our model achieves similar or better results in predicting the time-evolution of Navier-Stokes equations compared to the powerful Fourier Neural Operator (FNO) and two transformer-based neural operators OFormer and Galerkin Transformer.
    RigLSTM: Recurrent Independent Grid LSTM for Generalizable Sequence Learning. (arXiv:2311.02123v1 [cs.LG])
    Sequential processes in real-world often carry a combination of simple subsystems that interact with each other in certain forms. Learning such a modular structure can often improve the robustness against environmental changes. In this paper, we propose recurrent independent Grid LSTM (RigLSTM), composed of a group of independent LSTM cells that cooperate with each other, for exploiting the underlying modular structure of the target task. Our model adopts cell selection, input feature selection, hidden state selection, and soft state updating to achieve a better generalization ability on the basis of the recent Grid LSTM for the tasks where some factors differ between training and evaluation. Specifically, at each time step, only a fraction of cells are activated, and the activated cells select relevant inputs and cells to communicate with. At the end of one time step, the hidden states of the activated cells are updated by considering the relevance between the inputs and the hidden states from the last and current time steps. Extensive experiments on diversified sequential modeling tasks are conducted to show the superior generalization ability when there exist changes in the testing environment. Source code is available at \url{https://github.com/ziyuwwang/rig-lstm}.
    Using General Value Functions to Learn Domain-Backed Inventory Management Policies. (arXiv:2311.02125v1 [cs.LG])
    We consider the inventory management problem, where the goal is to balance conflicting objectives such as availability and wastage of a large range of products in a store. We propose a reinforcement learning (RL) approach that utilises General Value Functions (GVFs) to derive domain-backed inventory replenishment policies. The inventory replenishment decisions are modelled as a sequential decision making problem, which is challenging due to uncertain demand and the existence of aggregate (cross-product) constraints. In existing literature, GVFs have primarily been used for auxiliary task learning. We use this capability to train GVFs on domain-critical characteristics such as prediction of stock-out probability and wastage quantity. Using this domain expertise for more effective exploration, we train an RL agent to compute the inventory replenishment quantities for a large range of products (up to 6000 in the reported experiments), which share aggregate constraints such as the total weight/volume per delivery. Additionally, we show that the GVF predictions can be used to provide additional domain-backed insights into the decisions proposed by the RL agent. Finally, since the environment dynamics are fully transferred, the trained GVFs can be used for faster adaptation to vastly different business objectives (for example, due to the start of a promotional period or due to deployment in a new customer environment).
    Bayesian Optimization of Function Networks with Partial Evaluations. (arXiv:2311.02146v1 [stat.ML])
    Bayesian optimization is a framework for optimizing functions that are costly or time-consuming to evaluate. Recent work has considered Bayesian optimization of function networks (BOFN), where the objective function is computed via a network of functions, each taking as input the output of previous nodes in the network and additional parameters. Exploiting this network structure has been shown to yield significant performance improvements. Existing BOFN algorithms for general-purpose networks are required to evaluate the full network at each iteration. However, many real-world applications allow evaluating nodes individually. To take advantage of this opportunity, we propose a novel knowledge gradient acquisition function for BOFN that chooses which node to evaluate as well as the inputs for that node in a cost-aware fashion. This approach can dramatically reduce query costs by allowing the evaluation of part of the network at a lower cost relative to evaluating the entire network. We provide an efficient approach to optimizing our acquisition function and show it outperforms existing BOFN methods and other benchmarks across several synthetic and real-world problems. Our acquisition function is the first to enable cost-aware optimization of a broad class of function networks.
    What Knowledge Gets Distilled in Knowledge Distillation?. (arXiv:2205.16004v3 [cs.CV] UPDATED)
    Knowledge distillation aims to transfer useful information from a teacher network to a student network, with the primary goal of improving the student's performance for the task at hand. Over the years, there has a been a deluge of novel techniques and use cases of knowledge distillation. Yet, despite the various improvements, there seems to be a glaring gap in the community's fundamental understanding of the process. Specifically, what is the knowledge that gets distilled in knowledge distillation? In other words, in what ways does the student become similar to the teacher? Does it start to localize objects in the same way? Does it get fooled by the same adversarial samples? Does its data invariance properties become similar? Our work presents a comprehensive study to try to answer these questions. We show that existing methods can indeed indirectly distill these properties beyond improving task performance. We further study why knowledge distillation might work this way, and show that our findings have practical implications as well.
    A Robust Backpropagation-Free Framework for Images. (arXiv:2206.01820v2 [cs.NE] UPDATED)
    While current deep learning algorithms have been successful for a wide variety of artificial intelligence (AI) tasks, including those involving structured image data, they present deep neurophysiological conceptual issues due to their reliance on the gradients that are computed by backpropagation of errors (backprop). Gradients are required to obtain synaptic weight adjustments but require knowledge of feed-forward activities in order to conduct backward propagation, a biologically implausible process. This is known as the "weight transport problem". Therefore, in this work, we present a more biologically plausible approach towards solving the weight transport problem for image data. This approach, which we name the error kernel driven activation alignment (EKDAA) algorithm, accomplishes through the introduction of locally derived error transmission kernels and error maps. Like standard deep learning networks, EKDAA performs the standard forward process via weights and activation functions; however, its backward error computation involves adaptive error kernels that propagate local error signals through the network. The efficacy of EKDAA is demonstrated by performing visual-recognition tasks on the Fashion MNIST, CIFAR-10 and SVHN benchmarks, along with demonstrating its ability to extract visual features from natural color images. Furthermore, in order to demonstrate its non-reliance on gradient computations, results are presented for an EKDAA trained CNN that employs a non-differentiable activation function.
    Generative Artificial Intelligence in Healthcare: Ethical Considerations and Assessment Checklist. (arXiv:2311.02107v1 [cs.LG])
    The widespread use of ChatGPT and other emerging technology powered by generative artificial intelligence (AI) has drawn much attention to potential ethical issues, especially in high-stakes applications such as healthcare. However, less clear is how to resolve such issues beyond following guidelines and regulations that are still under discussion and development. On the other hand, other types of generative AI have been used to synthesize images and other types of data for research and practical purposes, which have resolved some ethical issues and exposed other ethical issues, but such technology is less often the focus of ongoing ethical discussions. Here we highlight gaps in current ethical discussions of generative AI via a systematic scoping review of relevant existing research in healthcare, and reduce the gaps by proposing an ethics checklist for comprehensive assessment and transparent documentation of ethical discussions in generative AI development. While the checklist can be readily integrated into the current peer review and publication system to enhance generative AI research, it may also be used in broader settings to disclose ethics-related considerations in generative AI-powered products (or real-life applications of such products) to help users establish reasonable trust in their capabilities.
    Using reinforcement learning to autonomously identify sources of error for agents in group missions. (arXiv:2107.09232v4 [cs.RO] UPDATED)
    When agents swarm to execute a mission, some of them frequently exhibit sudden failure, as observed from the command base. It is generally difficult to determine whether a failure is caused by actuators (hypothesis, $h_a$) or sensors (hypothesis, $h_s$) by solely relying on the communication between the command base and concerning agent. However, by instigating collusion between the agents, the cause of failure can be identified; in other words, we expect to detect corresponding displacements for $h_a$ but not for $h_s$. In this study, we considered the question as to whether artificial intelligence can autonomously generate an action plan $\boldsymbol{g}$ to pinpoint the cause as aforedescribed. Because the expected response to $\boldsymbol{g}$ generally depends upon the adopted hypothesis [let the difference be denoted by $D(\boldsymbol{g})$], a formulation that uses $D\left(\boldsymbol{g}\right)$ to pinpoint the cause can be made. Although a $\boldsymbol{g}^*$ that maximizes $D(\boldsymbol{g})$ would be a suitable action plan for this task, such an optimization is difficult to achieve using the conventional gradient method, as $D(\boldsymbol{g})$ becomes nonzero in rare events such as collisions with other agents, and most swarm actions $\boldsymbol{g}$ give $D(\boldsymbol{g})=0$. In other words, throughout almost the entire space of $\boldsymbol{g}$, $D(\boldsymbol{g})$ has zero gradient, and the gradient method is not applicable. To overcome this problem, we formulated an action plan using Q-table reinforcement learning. Surprisingly, the optimal action plan generated via reinforcement learning presented a human-like solution to pinpoint the problem by colliding other agents with the failed agent. Using this simple prototype, we demonstrated the potential of applying Q-table reinforcement learning methods to plan autonomous actions to pinpoint the causes of failure.
    Machine learning's own Industrial Revolution. (arXiv:2311.02278v1 [cs.LG])
    Machine learning is expected to enable the next Industrial Revolution. However, lacking standardized and automated assembly networks, ML faces significant challenges to meet ever-growing enterprise demands and empower broad industries. In the Perspective, we argue that ML needs to first complete its own Industrial Revolution, elaborate on how to best achieve its goals, and discuss new opportunities to enable rapid translation from ML's innovation frontier to mass production and utilization.
    Thermal Face Image Classification using Deep Learning Techniques. (arXiv:2311.02314v1 [cs.CV])
    Thermal images have various applications in security, medical and industrial domains. This paper proposes a practical deep-learning approach for thermal image classification. Accurate and efficient classification of thermal images poses a significant challenge across various fields due to the complex image content and the scarcity of annotated datasets. This work uses a convolutional neural network (CNN) architecture, specifically ResNet-50 and VGGNet-19, to extract features from thermal images. This work also applied Kalman filter on thermal input images for image denoising. The experimental results demonstrate the effectiveness of the proposed approach in terms of accuracy and efficiency.
    Joint Problems in Learning Multiple Dynamical Systems. (arXiv:2311.02181v1 [math.OC])
    Clustering of time series is a well-studied problem, with applications ranging from quantitative, personalized models of metabolism obtained from metabolite concentrations to state discrimination in quantum information theory. We consider a variant, where given a set of trajectories and a number of parts, we jointly partition the set of trajectories and learn linear dynamical system (LDS) models for each part, so as to minimize the maximum error across all the models. We present globally convergent methods and EM heuristics, accompanied by promising computational results.
    RDumb: A simple approach that questions our progress in continual test-time adaptation. (arXiv:2306.05401v2 [cs.LG] UPDATED)
    Test-Time Adaptation (TTA) allows to update pre-trained models to changing data distributions at deployment time. While early work tested these algorithms for individual fixed distribution shifts, recent work proposed and applied methods for continual adaptation over long timescales. To examine the reported progress in the field, we propose the Continually Changing Corruptions (CCC) benchmark to measure asymptotic performance of TTA techniques. We find that eventually all but one state-of-the-art methods collapse and perform worse than a non-adapting model, including models specifically proposed to be robust to performance collapse. In addition, we introduce a simple baseline, "RDumb", that periodically resets the model to its pretrained state. RDumb performs better or on par with the previously proposed state-of-the-art in all considered benchmarks. Our results show that previous TTA approaches are neither effective at regularizing adaptation to avoid collapse nor able to outperform a simplistic resetting strategy.
    Towards objective and systematic evaluation of bias in medical imaging AI. (arXiv:2311.02115v1 [cs.CV])
    Artificial intelligence (AI) models trained using medical images for clinical tasks often exhibit bias in the form of disparities in performance between subgroups. Since not all sources of biases in real-world medical imaging data are easily identifiable, it is challenging to comprehensively assess how those biases are encoded in models, and how capable bias mitigation methods are at ameliorating performance disparities. In this article, we introduce a novel analysis framework for systematically and objectively investigating the impact of biases in medical images on AI models. We developed and tested this framework for conducting controlled in silico trials to assess bias in medical imaging AI using a tool for generating synthetic magnetic resonance images with known disease effects and sources of bias. The feasibility is showcased by using three counterfactual bias scenarios to measure the impact of simulated bias effects on a convolutional neural network (CNN) classifier and the efficacy of three bias mitigation strategies. The analysis revealed that the simulated biases resulted in expected subgroup performance disparities when the CNN was trained on the synthetic datasets. Moreover, reweighing was identified as the most successful bias mitigation strategy for this setup, and we demonstrated how explainable AI methods can aid in investigating the manifestation of bias in the model using this framework. Developing fair AI models is a considerable challenge given that many and often unknown sources of biases can be present in medical imaging datasets. In this work, we present a novel methodology to objectively study the impact of biases and mitigation strategies on deep learning pipelines, which can support the development of clinical AI that is robust and responsible.
    Variational Autoencoders for Noise Reduction in Industrial LLRF Systems. (arXiv:2311.02096v1 [physics.acc-ph])
    Industrial particle accelerators inherently operate in much dirtier environments than typical research accelerators. This leads to an increase in noise both in the RF system and in other electronic systems. Combined with the fact that industrial accelerators are mass produced, there is less attention given to optimizing the performance of an individual system. As a result, industrial systems tend to under perform considering their hardware hardware capabilities. With the growing demand for accelerators for medical sterilization, food irradiation, cancer treatment, and imaging, improving the signal processing of these machines will increase the margin for the deployment of these systems. Our work is focusing on using machine learning techniques to reduce the noise of RF signals used for pulse-to-pulse feedback in industrial accelerators. We will review our algorithms, simulation results, and results working with measured data. We will then discuss next steps for deployment and testing on an industrial system.
    Relax: Composable Abstractions for End-to-End Dynamic Machine Learning. (arXiv:2311.02103v1 [cs.LG])
    Dynamic shape computations have become critical in modern machine learning workloads, especially in emerging large language models. The success of these models has driven demand for deploying them to a diverse set of backend environments. In this paper, we present Relax, a compiler abstraction for optimizing end-to-end dynamic machine learning workloads. Relax introduces first-class symbolic shape annotations to track dynamic shape computations globally across the program. It also introduces a cross-level abstraction that encapsulates computational graphs, loop-level tensor programs, and library calls in a single representation to enable cross-level optimizations. We build an end-to-end compilation framework using the proposed approach to optimize dynamic shape models. Experimental results on large language models show that Relax delivers performance competitive with state-of-the-art hand-optimized systems across platforms and enables deployment of emerging dynamic models to a broader set of environments, including mobile phones, embedded devices, and web browsers.
    Cooperative Network Learning for Large-Scale and Decentralized Graphs. (arXiv:2311.02117v1 [cs.LG])
    Graph research, the systematic study of interconnected data points represented as graphs, plays a vital role in capturing intricate relationships within networked systems. However, in the real world, as graphs scale up, concerns about data security among different data-owning agencies arise, hindering information sharing and, ultimately, the utilization of graph data. Therefore, establishing a mutual trust mechanism among graph agencies is crucial for unlocking the full potential of graphs. Here, we introduce a Cooperative Network Learning (CNL) framework to ensure secure graph computing for various graph tasks. Essentially, this CNL framework unifies the local and global perspectives of GNN computing with distributed data for an agency by virtually connecting all participating agencies as a global graph without a fixed central coordinator. Inter-agency computing is protected by various technologies inherent in our framework, including homomorphic encryption and secure transmission. Moreover, each agency has a fair right to design or employ various graph learning models from its local or global perspective. Thus, CNL can collaboratively train GNN models based on decentralized graphs inferred from local and global graphs. Experiments on contagion dynamics prediction and traditional graph tasks (i.e., node classification and link prediction) demonstrate that our CNL architecture outperforms state-of-the-art GNNs developed at individual sites, revealing that CNL can provide a reliable, fair, secure, privacy-preserving, and global perspective to build effective and personalized models for network applications. We hope this framework will address privacy concerns in graph-related research and integrate decentralized graph data structures to benefit the network research community in cooperation and innovation.
  • Open

    The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing. (arXiv:2302.01186v3 [cs.LG] UPDATED)
    We propose $\textsf{ScaledGD($\lambda$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $\textsf{ScaledGD($\lambda$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $\textsf{ScaledGD($\lambda$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($\textsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $\textsf{ScaledGD($\lambda$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $\textsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.
    Multilevel Diffusion: Infinite Dimensional Score-Based Diffusion Models for Image Generation. (arXiv:2303.04772v3 [cs.LG] UPDATED)
    Score-based diffusion models (SBDM) have recently emerged as state-of-the-art approaches for image generation. Existing SBDMs are typically formulated in a finite-dimensional setting, where images are considered as tensors of finite size. This paper develops SBDMs in the infinite-dimensional setting, that is, we model the training data as functions supported on a rectangular domain. Besides the quest for generating images at ever higher resolution, our primary motivation is to create a well-posed infinite-dimensional learning problem so that we can discretize it consistently on multiple resolution levels. We thereby intend to obtain diffusion models that generalize across different resolution levels and improve the efficiency of the training process. We demonstrate how to overcome two shortcomings of current SBDM approaches in the infinite-dimensional setting. First, we modify the forward process to ensure that the latent distribution is well-defined in the infinite-dimensional setting using the notion of trace class operators. We derive the reverse processes for finite approximations. Second, we illustrate that approximating the score function with an operator network is beneficial for multilevel training. After deriving the convergence of the discretization and the approximation of multilevel training, we implement an infinite-dimensional SBDM approach and show the first promising results on MNIST and Fashion-MNIST, underlining our developed theory.
    Online covariance estimation for stochastic gradient descent under Markovian sampling. (arXiv:2308.01481v2 [math.ST] UPDATED)
    We investigate the online overlapping batch-means covariance estimator for Stochastic Gradient Descent (SGD) under Markovian sampling. Convergence rates of order $O\big(\sqrt{d}\,n^{-1/8}(\log n)^{1/4}\big)$ and $O\big(\sqrt{d}\,n^{-1/8}\big)$ are established under state-dependent and state-independent Markovian sampling, respectively, where $d$ is the dimensionality and $n$ denotes observations or SGD iterations. These rates match the best-known convergence rate for independent and identically distributed (i.i.d) data. Our analysis overcomes significant challenges that arise due to Markovian sampling, leading to the introduction of additional error terms and complex dependencies between the blocks of the batch-means covariance estimator. Moreover, we establish the convergence rate for the first four moments of the $\ell_2$ norm of the error of SGD dynamics under state-dependent Markovian data, which holds potential interest as an independent result. Numerical illustrations provide confidence intervals for SGD in linear and logistic regression models under Markovian sampling. Additionally, our method is applied to the strategic classification with logistic regression, where adversaries adaptively modify features during training to affect target class classification.
    Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation. (arXiv:2310.18919v2 [cs.LG] UPDATED)
    Recent studies in reinforcement learning (RL) have made significant progress by leveraging function approximation to alleviate the sample complexity hurdle for better performance. Despite the success, existing provably efficient algorithms typically rely on the accessibility of immediate feedback upon taking actions. The failure to account for the impact of delay in observations can significantly degrade the performance of real-world systems due to the regret blow-up. In this work, we tackle the challenge of delayed feedback in RL with linear function approximation by employing posterior sampling, which has been shown to empirically outperform the popular UCB algorithms in a wide range of regimes. We first introduce Delayed-PSVI, an optimistic value-based algorithm that effectively explores the value function space via noise perturbation with posterior sampling. We provide the first analysis for posterior sampling algorithms with delayed feedback in RL and show our algorithm achieves $\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 E[\tau])$ worst-case regret in the presence of unknown stochastic delays. Here $E[\tau]$ is the expected delay. To further improve its computational efficiency and to expand its applicability in high-dimensional RL problems, we incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI, which maintains the same order-optimal regret guarantee with $\widetilde{O}(dHK)$ computational cost. Empirical evaluations are performed to demonstrate the statistical and computational efficacy of our algorithms.
    Moment Matching Denoising Gibbs Sampling. (arXiv:2305.11650v2 [stat.ML] UPDATED)
    Energy-Based Models (EBMs) offer a versatile framework for modeling complex data distributions. However, training and sampling from EBMs continue to pose significant challenges. The widely-used Denoising Score Matching (DSM) method for scalable EBM training suffers from inconsistency issues, causing the energy model to learn a `noisy' data distribution. In this work, we propose an efficient sampling framework: (pseudo)-Gibbs sampling with moment matching, which enables effective sampling from the underlying clean model when given a `noisy' model that has been well-trained via DSM. We explore the benefits of our approach compared to related methods and demonstrate how to scale the method to high-dimensional datasets.
    Differentiable Cutting-plane Layers for Mixed-integer Linear Optimization. (arXiv:2311.03350v1 [math.OC])
    We consider the problem of solving a family of parametric mixed-integer linear optimization problems where some entries in the input data change. We introduce the concept of $cutting-plane$ $layer$ (CPL), $i.e.$, a differentiable cutting-plane generator mapping the problem data and previous iterates to cutting planes. We propose a CPL implementation to generate split cuts, and by combining several CPLs, we devise a differentiable cutting-plane algorithm that exploits the repeated nature of parametric instances. In an offline phase, we train our algorithm by updating the parameters controlling the CPLs, thus altering cut generation. Once trained, our algorithm computes, with predictable execution times and a fixed number of cuts, solutions with low integrality gaps. Preliminary computational tests show that our algorithm generalizes on unseen instances and captures underlying parametric structures.
    ProtoryNet - Interpretable Text Classification Via Prototype Trajectories. (arXiv:2007.01777v5 [cs.LG] UPDATED)
    We propose a novel interpretable deep neural network for text classification, called ProtoryNet, based on a new concept of prototype trajectories. Motivated by the prototype theory in modern linguistics, ProtoryNet makes a prediction by finding the most similar prototype for each sentence in a text sequence and feeding an RNN backbone with the proximity of each sentence to the corresponding active prototype. The RNN backbone then captures the temporal pattern of the prototypes, which we refer to as prototype trajectories. Prototype trajectories enable intuitive and fine-grained interpretation of the reasoning process of the RNN model, in resemblance to how humans analyze texts. We also design a prototype pruning procedure to reduce the total number of prototypes used by the model for better interpretability. Experiments on multiple public data sets show that ProtoryNet is more accurate than the baseline prototype-based deep neural net and reduces the performance gap compared to state-of-the-art black-box models. In addition, after prototype pruning, the resulting ProtoryNet models only need less than or around 20 prototypes for all datasets, which significantly benefits interpretability. Furthermore, we report a survey result indicating that human users find ProtoryNet more intuitive and easier to understand than other prototype-based methods.
    Is RLHF More Difficult than Standard RL?. (arXiv:2306.14111v2 [cs.LG] UPDATED)
    Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. Preferences arguably contain less information than rewards, which makes preference-based RL seemingly more difficult. This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games with a restricted set of policies. The latter case can be further reduced to adversarial MDP when preferences only depend on the final state. We instantiate all reward-based RL subroutines by concrete provable algorithms, and apply our theory to a large class of models including tabular MDPs and MDPs with generic function approximation. We further provide guarantees when K-wise comparisons are available.
    Architecture Matters: Uncovering Implicit Mechanisms in Graph Contrastive Learning. (arXiv:2311.02687v1 [cs.LG])
    With the prosperity of contrastive learning for visual representation learning (VCL), it is also adapted to the graph domain and yields promising performance. However, through a systematic study of various graph contrastive learning (GCL) methods, we observe that some common phenomena among existing GCL methods that are quite different from the original VCL methods, including 1) positive samples are not a must for GCL; 2) negative samples are not necessary for graph classification, neither for node classification when adopting specific normalization modules; 3) data augmentations have much less influence on GCL, as simple domain-agnostic augmentations (e.g., Gaussian noise) can also attain fairly good performance. By uncovering how the implicit inductive bias of GNNs works in contrastive learning, we theoretically provide insights into the above intriguing properties of GCL. Rather than directly porting existing VCL methods to GCL, we advocate for more attention toward the unique architecture of graph learning and consider its implicit influence when designing GCL methods. Code is available at https: //github.com/PKU-ML/ArchitectureMattersGCL.
    Parameter-Agnostic Optimization under Relaxed Smoothness. (arXiv:2311.03252v1 [math.OC])
    Tuning hyperparameters, such as the stepsize, presents a major challenge of training machine learning models. To address this challenge, numerous adaptive optimization algorithms have been developed that achieve near-optimal complexities, even when stepsizes are independent of problem-specific parameters, provided that the loss function is $L$-smooth. However, as the assumption is relaxed to the more realistic $(L_0, L_1)$-smoothness, all existing convergence results still necessitate tuning of the stepsize. In this study, we demonstrate that Normalized Stochastic Gradient Descent with Momentum (NSGD-M) can achieve a (nearly) rate-optimal complexity without prior knowledge of any problem parameter, though this comes at the cost of introducing an exponential term dependent on $L_1$ in the complexity. We further establish that this exponential term is inevitable to such schemes by introducing a theoretical framework of lower bounds tailored explicitly for parameter-agnostic algorithms. Interestingly, in deterministic settings, the exponential factor can be neutralized by employing Gradient Descent with a Backtracking Line Search. To the best of our knowledge, these findings represent the first parameter-agnostic convergence results under the generalized smoothness condition. Our empirical experiments further confirm our theoretical insights.
    Sampling via Gradient Flows in the Space of Probability Measures. (arXiv:2310.03597v2 [stat.ML] UPDATED)
    Sampling a target probability distribution with an unknown normalization constant is a fundamental challenge in computational science and engineering. Recent work shows that algorithms derived by considering gradient flows in the space of probability measures open up new avenues for algorithm development. This paper makes three contributions to this sampling approach by scrutinizing the design components of such gradient flows. Any instantiation of a gradient flow for sampling needs an energy functional and a metric to determine the flow, as well as numerical approximations of the flow to derive algorithms. Our first contribution is to show that the Kullback-Leibler divergence, as an energy functional, has the unique property (among all f-divergences) that gradient flows resulting from it do not depend on the normalization constant of the target distribution. Our second contribution is to study the choice of metric from the perspective of invariance. The Fisher-Rao metric is known as the unique choice (up to scaling) that is diffeomorphism invariant. As a computationally tractable alternative, we introduce a relaxed, affine invariance property for the metrics and gradient flows. In particular, we construct various affine invariant Wasserstein and Stein gradient flows. Affine invariant gradient flows are shown to behave more favorably than their non-affine-invariant counterparts when sampling highly anisotropic distributions, in theory and by using particle methods. Our third contribution is to study, and develop efficient algorithms based on Gaussian approximations of the gradient flows; this leads to an alternative to particle methods. We establish connections between various Gaussian approximate gradient flows, discuss their relation to gradient methods arising from parametric variational inference, and study their convergence properties both theoretically and numerically.
    Uncertainty Quantification via Neural Posterior Principal Components. (arXiv:2309.15533v2 [cs.CV] UPDATED)
    Uncertainty quantification is crucial for the deployment of image restoration models in safety-critical domains, like autonomous driving and biological imaging. To date, methods for uncertainty visualization have mainly focused on per-pixel estimates. Yet, a heatmap of per-pixel variances is typically of little practical use, as it does not capture the strong correlations between pixels. A more natural measure of uncertainty corresponds to the variances along the principal components (PCs) of the posterior distribution. Theoretically, the PCs can be computed by applying PCA on samples generated from a conditional generative model for the input image. However, this requires generating a very large number of samples at test time, which is painfully slow with the current state-of-the-art (diffusion) models. In this work, we present a method for predicting the PCs of the posterior distribution for any input image, in a single forward pass of a neural network. Our method can either wrap around a pre-trained model that was trained to minimize the mean square error (MSE), or can be trained from scratch to output both a predicted image and the posterior PCs. We showcase our method on multiple inverse problems in imaging, including denoising, inpainting, super-resolution, and biological image-to-image translation. Our method reliably conveys instance-adaptive uncertainty directions, achieving uncertainty quantification comparable with posterior samplers while being orders of magnitude faster. Code and examples are available at https://eliasnehme.github.io/NPPC/
    Modelling Cellular Perturbations with the Sparse Additive Mechanism Shift Variational Autoencoder. (arXiv:2311.02794v1 [stat.ML])
    Generative models of observations under interventions have been a vibrant topic of interest across machine learning and the sciences in recent years. For example, in drug discovery, there is a need to model the effects of diverse interventions on cells in order to characterize unknown biological mechanisms of action. We propose the Sparse Additive Mechanism Shift Variational Autoencoder, SAMS-VAE, to combine compositionality, disentanglement, and interpretability for perturbation models. SAMS-VAE models the latent state of a perturbed sample as the sum of a local latent variable capturing sample-specific variation and sparse global variables of latent intervention effects. Crucially, SAMS-VAE sparsifies these global latent variables for individual perturbations to identify disentangled, perturbation-specific latent subspaces that are flexibly composable. We evaluate SAMS-VAE both quantitatively and qualitatively on a range of tasks using two popular single cell sequencing datasets. In order to measure perturbation-specific model-properties, we also introduce a framework for evaluation of perturbation models based on average treatment effects with links to posterior predictive checks. SAMS-VAE outperforms comparable models in terms of generalization across in-distribution and out-of-distribution tasks, including a combinatorial reasoning task under resource paucity, and yields interpretable latent structures which correlate strongly to known biological mechanisms. Our results suggest SAMS-VAE is an interesting addition to the modeling toolkit for machine learning-driven scientific discovery.
    Efficient Robust Bayesian Optimization for Arbitrary Uncertain Inputs. (arXiv:2310.20145v2 [cs.LG] UPDATED)
    Bayesian Optimization (BO) is a sample-efficient optimization algorithm widely employed across various applications. In some challenging BO tasks, input uncertainty arises due to the inevitable randomness in the optimization process, such as machining errors, execution noise, or contextual variability. This uncertainty deviates the input from the intended value before evaluation, resulting in significant performance fluctuations in the final result. In this paper, we introduce a novel robust Bayesian Optimization algorithm, AIRBO, which can effectively identify a robust optimum that performs consistently well under arbitrary input uncertainty. Our method directly models the uncertain inputs of arbitrary distributions by empowering the Gaussian Process with the Maximum Mean Discrepancy (MMD) and further accelerates the posterior inference via Nystrom approximation. Rigorous theoretical regret bound is established under MMD estimation error and extensive experiments on synthetic functions and real problems demonstrate that our approach can handle various input uncertainties and achieve state-of-the-art performance.
    Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. (arXiv:2210.07893v3 [stat.ML] UPDATED)
    Gaussian processes are frequently deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts of the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. To do so, we first review numerical stability, and illustrate typical situations in which Gaussian process models can be unstable. Building on stability theory originally developed in the interpolation literature, we derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We provide illustrative examples showing the relationship between stability of calculations and predictive performance of inducing point methods on spatial tasks.
    Practical Equivariances via Relational Conditional Neural Processes. (arXiv:2306.10915v2 [stat.ML] UPDATED)
    Conditional Neural Processes (CNPs) are a class of metalearning models popular for combining the runtime efficiency of amortized inference with reliable uncertainty quantification. Many relevant machine learning tasks, such as in spatio-temporal modeling, Bayesian Optimization and continuous control, inherently contain equivariances -- for example to translation -- which the model can exploit for maximal performance. However, prior attempts to include equivariances in CNPs do not scale effectively beyond two input dimensions. In this work, we propose Relational Conditional Neural Processes (RCNPs), an effective approach to incorporate equivariances into any neural process model. Our proposed method extends the applicability and impact of equivariant neural processes to higher dimensions. We empirically demonstrate the competitive performance of RCNPs on a large array of tasks naturally containing equivariances.
    Exact Generalization Guarantees for (Regularized) Wasserstein Distributionally Robust Models. (arXiv:2305.17076v2 [cs.LG] UPDATED)
    Wasserstein distributionally robust estimators have emerged as powerful models for prediction and decision-making under uncertainty. These estimators provide attractive generalization guarantees: the robust objective obtained from the training distribution is an exact upper bound on the true risk with high probability. However, existing guarantees either suffer from the curse of dimensionality, are restricted to specific settings, or lead to spurious error terms. In this paper, we show that these generalization guarantees actually hold on general classes of models, do not suffer from the curse of dimensionality, and can even cover distribution shifts at testing. We also prove that these results carry over to the newly-introduced regularized versions of Wasserstein distributionally robust problems.
    PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference. (arXiv:2309.02334v2 [cs.LG] UPDATED)
    Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by the idea that the LUTs in an FPGA can be used to implement a much greater variety of functions than this. In this paper, we propose a novel approach to training neural networks for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. We show that by using polynomial building blocks, we can achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset.
    Differentiable Clustering with Perturbed Spanning Forests. (arXiv:2305.16358v3 [cs.LG] UPDATED)
    We introduce a differentiable clustering method based on stochastic perturbations of minimum-weight spanning forests. This allows us to include clustering in end-to-end trainable pipelines, with efficient gradients. We show that our method performs well even in difficult settings, such as data sets with high noise and challenging geometries. We also formulate an ad hoc loss to efficiently learn from partial clustering data using this operation. We demonstrate its performance on several data sets for supervised and semi-supervised tasks.
    Fine-Tune Language Models as Differential Equation Solvers. (arXiv:2308.05061v2 [cs.LG] UPDATED)
    In the growing domain of scientific machine learning, in-context operator learning has shown notable potential in learning operators and solving differential equations using prompted data, during the inference stage without weight updates. However, the current model's overdependence on function data, may inadvertently overlook the invaluable human insight into the operator. To address this, we present a transformation of in-context operator learning into a multi-modal paradigm. In particular, we take inspiration from the recent success of large language models, and propose using "captions" to integrate human knowledge about the operator, expressed through natural language descriptions and equations. Also, we introduce a novel approach to train a language-model-like architecture, or directly fine-tune existing language models, for in-context operator learning. We beat the baseline on single-modal learning tasks, and also demonstrated the effectiveness of multi-modal learning in enhancing performance and reducing function data requirements. The proposed method not only significantly improves in-context operator learning, but also creates a new path for the application of language models.
    Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. (arXiv:2302.06960v3 [stat.ML] UPDATED)
    Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e., where a fraction of $30\%$ or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws; in [Sorscher et al.], the authors showed the need for high-quality data pruning algorithms in order to beat the sample power law. In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate ``No Free Lunch" theorems for data pruning and present calibration protocols that enhance the performance of existing pruning algorithms in this high compression regime using randomization.
    Finding Counterfactually Optimal Action Sequences in Continuous State Spaces. (arXiv:2306.03929v2 [cs.LG] UPDATED)
    Whenever a clinician reflects on the efficacy of a sequence of treatment decisions for a patient, they may try to identify critical time steps where, had they made different decisions, the patient's health would have improved. While recent methods at the intersection of causal inference and reinforcement learning promise to aid human experts, as the clinician above, to retrospectively analyze sequential decision making processes, they have focused on environments with finitely many discrete states. However, in many practical applications, the state of the environment is inherently continuous in nature. In this paper, we aim to fill this gap. We start by formally characterizing a sequence of discrete actions and continuous states using finite horizon Markov decision processes and a broad class of bijective structural causal models. Building upon this characterization, we formalize the problem of finding counterfactually optimal action sequences and show that, in general, we cannot expect to solve it in polynomial time. Then, we develop a search method based on the $A^*$ algorithm that, under a natural form of Lipschitz continuity of the environment's dynamics, is guaranteed to return the optimal solution to the problem. Experiments on real clinical data show that our method is very efficient in practice, and it has the potential to offer interesting insights for sequential decision making tasks.
    Bridging RL Theory and Practice with the Effective Horizon. (arXiv:2304.09853v2 [cs.LG] UPDATED)
    Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We choose to focus on deterministic environments because they share many interesting properties of stochastic environments, but are easier to analyze. Using BRIDGE, we find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy (i.e. when it is optimal to be greedy on the random policy's Q function), deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon
    On existence, uniqueness and scalability of adversarial robustness measures for AI classifiers. (arXiv:2310.14421v2 [stat.ML] UPDATED)
    Simply-verifiable mathematical conditions for existence, uniqueness and explicit analytical computation of minimal adversarial paths (MAP) and minimal adversarial distances (MAD) for (locally) uniquely-invertible classifiers, for generalized linear models (GLM), and for entropic AI (EAI) are formulated and proven. Practical computation of MAP and MAD, their comparison and interpretations for various classes of AI tools (for neuronal networks, boosted random forests, GLM and EAI) are demonstrated on the common synthetic benchmarks: on a double Swiss roll spiral and its extensions, as well as on the two biomedical data problems (for the health insurance claim predictions, and for the heart attack lethality classification). On biomedical applications it is demonstrated how MAP provides unique minimal patient-specific risk-mitigating interventions in the predefined subsets of accessible control variables.
    Almost Equivariance via Lie Algebra Convolutions. (arXiv:2310.13164v1 [cs.LG] CROSS LISTED)
    Recently, the equivariance of models with respect to a group action has become an important topic of research in machine learning. However, imbuing an architecture with a specific group equivariance imposes a strong prior on the types of data transformations that the model expects to see. While strictly-equivariant models enforce symmetries, real-world data does not always conform to such strict equivariances, be it due to noise in the data or underlying physical laws that encode only approximate or partial symmetries. In such cases, the prior of strict equivariance can actually prove too strong and cause models to underperform on real-world data. Therefore, in this work we study a closely related topic, that of almost equivariance. We provide a definition of almost equivariance that differs from those extant in the current literature and give a practical method for encoding almost equivariance in models by appealing to the Lie algebra of a Lie group. Specifically, we define Lie algebra convolutions and demonstrate that they offer several benefits over Lie group convolutions, including being well-defined for non-compact groups. From there, we pivot to the realm of theory and demonstrate connections between the notions of equivariance and isometry and those of almost equivariance and almost isometry, respectively. We prove two existence theorems, one showing the existence of almost isometries within bounded distance of isometries of a general manifold, and another showing the converse for Hilbert spaces. We then extend these theorems to prove the existence of almost equivariant manifold embeddings within bounded distance of fully equivariant embedding functions, subject to certain constraints on the group action and the function class. Finally, we demonstrate the validity of our approach by benchmarking against datasets in fully equivariant and almost equivariant settings.
    Estimation and inference for transfer learning with high-dimensional quantile regression. (arXiv:2211.14578v3 [stat.ML] UPDATED)
    Transfer learning has become an essential technique to exploit information from the source domain to boost performance of the target task. Despite the prevalence in high-dimensional data, heterogeneity and heavy tails are insufficiently accounted for by current transfer learning approaches and thus may undermine the resulting performance. We propose a transfer learning procedure in the framework of high-dimensional quantile regression models to accommodate heterogeneity and heavy tails in the source and target domains. We establish error bounds of transfer learning estimator based on delicately selected transferable source domains, showing that lower error bounds can be achieved for critical selection criterion and larger sample size of source tasks. We further propose valid confidence interval and hypothesis test procedures for individual component of high-dimensional quantile regression coefficients by advocating a double transfer learning estimator, which is one-step debiased estimator for the transfer learning estimator wherein the technique of transfer learning is designed again. By adopting data-splitting technique, we advocate a transferability detection approach that guarantees to circumvent negative transfer and identify transferable sources with high probability. Simulation results demonstrate that the proposed method exhibits some favorable and compelling performances and the practical utility is further illustrated by analyzing a real example.
    Detecting hidden confounding in observational data using multiple environments. (arXiv:2205.13935v4 [stat.ME] UPDATED)
    A common assumption in causal inference from observational data is that there is no hidden confounding. Yet it is, in general, impossible to verify this assumption from a single dataset. Under the assumption of independent causal mechanisms underlying the data-generating process, we demonstrate a way to detect unobserved confounders when having multiple observational datasets coming from different environments. We present a theory for testable conditional independencies that are only absent when there is hidden confounding and examine cases where we violate its assumptions: degenerate & dependent mechanisms, and faithfulness violations. Additionally, we propose a procedure to test these independencies and study its empirical finite-sample behavior using simulation studies and semi-synthetic data based on a real-world dataset. In most cases, the proposed procedure correctly predicts the presence of hidden confounding, particularly when the confounding bias is large.
    For SALE: State-Action Representation Learning for Deep Reinforcement Learning. (arXiv:2306.02451v2 [cs.LG] UPDATED)
    In the field of reinforcement learning (RL), representation learning is a proven tool for complex image-based tasks, but is often overlooked for environments with low-level states, such as physical control problems. This paper introduces SALE, a novel approach for learning embeddings that model the nuanced interaction between state and action, enabling effective representation learning from low-level states. We extensively study the design space of these embeddings and highlight important design considerations. We integrate SALE and an adaptation of checkpoints for RL into TD3 to form the TD7 algorithm, which significantly outperforms existing continuous control algorithms. On OpenAI gym benchmark tasks, TD7 has an average performance gain of 276.7% and 50.7% over TD3 at 300k and 5M time steps, respectively, and works in both the online and offline settings.
    Learning Hard-Constrained Models with One Sample. (arXiv:2311.03332v1 [cs.LG])
    We consider the problem of estimating the parameters of a Markov Random Field with hard-constraints using a single sample. As our main running examples, we use the $k$-SAT and the proper coloring models, as well as general $H$-coloring models; for all of these we obtain both positive and negative results. In contrast to the soft-constrained case, we show in particular that single-sample estimation is not always possible, and that the existence of an estimator is related to the existence of non-satisfiable instances. Our algorithms are based on the pseudo-likelihood estimator. We show variance bounds for this estimator using coupling techniques inspired, in the case of $k$-SAT, by Moitra's sampling algorithm (JACM, 2019); our positive results for colorings build on this new coupling approach. For $q$-colorings on graphs with maximum degree $d$, we give a linear-time estimator when $q>d+1$, whereas the problem is non-identifiable when $q\leq d+1$. For general $H$-colorings, we show that standard conditions that guarantee sampling, such as Dobrushin's condition, are insufficient for one-sample learning; on the positive side, we provide a general condition that is sufficient to guarantee linear-time learning and obtain applications for proper colorings and permissive models. For the $k$-SAT model on formulas with maximum degree $d$, we provide a linear-time estimator when $k\gtrsim 6.45\log d$, whereas the problem becomes non-identifiable when $k\lesssim \log d$.
    Are you using test log-likelihood correctly?. (arXiv:2212.00219v3 [stat.ML] UPDATED)
    Test log-likelihood is commonly used to compare different models of the same data or different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error.
    Benign Overfitting for Two-layer ReLU Convolutional Neural Networks. (arXiv:2303.04145v2 [cs.LG] UPDATED)
    Modern deep learning models with great expressive power can be trained to overfit the training data but still generalize well. This phenomenon is referred to as \textit{benign overfitting}. Recently, a few studies have attempted to theoretically understand benign overfitting in neural networks. However, these works are either limited to neural networks with smooth activation functions or to the neural tangent kernel regime. How and when benign overfitting can occur in ReLU neural networks remains an open problem. In this work, we seek to answer this question by establishing algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise. We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk. Our result also reveals a sharp transition between benign and harmful overfitting under different conditions on data distribution in terms of test risk. Experiments on synthetic data back up our theory.
    Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes. (arXiv:2212.06132v3 [cs.LG] UPDATED)
    We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous linear Markov decision processes (linear MDPs) whose transition probability can be parameterized as a linear function of a given feature mapping, we propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $K$ is the number of episodes. Our algorithm is based on a weighted linear regression scheme with a carefully designed weight, which depends on a new variance estimator that (1) directly estimates the variance of the optimal value function, (2) monotonically decreases with respect to the number of episodes to ensure a better estimation accuracy, and (3) uses a rare-switching policy to update the value function estimator to control the complexity of the estimated value function class. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
    Fast Kernel Methods for Generic Lipschitz Losses via $p$-Sparsified Sketches. (arXiv:2206.03827v7 [stat.ML] UPDATED)
    Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, which consists in looking for solutions among a subspace of reduced dimension, is a well studied approach to alleviate these computational burdens. However, statistically-accurate sketches, such as the Gaussian one, usually contain few null entries, such that their application to kernel methods and their non-sparse Gram matrices remains slow in practice. In this paper, we show that sparsified Gaussian (and Rademacher) sketches still produce theoretically-valid approximations while allowing for important time and space savings thanks to an efficient \emph{decomposition trick}. To support our method, we derive excess risk bounds for both single and multiple output kernel problems, with generic Lipschitz losses, hereby providing new guarantees for a wide range of applications, from robust regression to multiple quantile regression. Our theoretical results are complemented with experiments showing the empirical superiority of our approach over SOTA sketching methods.
    A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence. (arXiv:2301.13139v3 [stat.ML] UPDATED)
    Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially in the tabular setting, the use of general parameterization schemes remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parameterizations. The policy class induced by our scheme recovers known classes, e.g., softmax, and generates new ones depending on the choice of mirror map. Using our framework, we obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization. To demonstrate the ability of our framework to accommodate general parameterization schemes, we provide its sample complexity when using shallow neural networks, show that it represents an improvement upon the previous best results, and empirically validate the effectiveness of our theoretical claims on classic control tasks.
    Transfer-Learning Across Datasets with Different Input Dimensions: An Algorithm and Analysis for the Linear Regression Case. (arXiv:2202.05069v4 [stat.ML] UPDATED)
    With the development of new sensors and monitoring devices, more sources of data become available to be used as inputs for machine learning models. These can on the one hand help to improve the accuracy of a model. On the other hand, combining these new inputs with historical data remains a challenge that has not yet been studied in enough detail. In this work, we propose a transfer learning algorithm that combines new and historical data with different input dimensions. This approach is easy to implement, efficient, with computational complexity equivalent to the ordinary least-squares method, and requires no hyperparameter tuning, making it straightforward to apply when the new data is limited. Different from other approaches, we provide a rigorous theoretical study of its robustness, showing that it cannot be outperformed by a baseline that utilizes only the new data. Our approach achieves state-of-the-art performance on 9 real-life datasets, outperforming the linear DSFT, another linear transfer learning algorithm, and performing comparably to non-linear DSFT.
    DiffLoad: Uncertainty Quantification in Load Forecasting with Diffusion Model. (arXiv:2306.01001v2 [cs.LG] UPDATED)
    Electrical load forecasting plays a crucial role in decision-making for power systems, including unit commitment and economic dispatch. The integration of renewable energy sources and the occurrence of external events, such as the COVID-19 pandemic, have rapidly increased uncertainties in load forecasting. The uncertainties in load forecasting can be divided into two types: epistemic uncertainty and aleatoric uncertainty. Separating these types of uncertainties can help decision-makers better understand where and to what extent the uncertainty is, thereby enhancing their confidence in the following decision-making. This paper proposes a diffusion-based Seq2Seq structure to estimate epistemic uncertainty and employs the robust additive Cauchy distribution to estimate aleatoric uncertainty. Our method not only ensures the accuracy of load forecasting but also demonstrates the ability to separate the two types of uncertainties and be applicable to different levels of loads. The relevant code can be found at \url{https://anonymous.4open.science/r/DiffLoad-4714/}.
    Flooding with Absorption: An Efficient Protocol for Heterogeneous Bandits over Complex Networks. (arXiv:2303.05445v3 [cs.LG] UPDATED)
    Multi-armed bandits are extensively used to model sequential decision-making, making them ubiquitous in many real-life applications such as online recommender systems and wireless networking. We consider a multi-agent setting where each agent solves their own bandit instance endowed with a different set of arms. Their goal is to minimize their group regret while collaborating via some communication protocol over a given network. Previous literature on this problem only considered arm heterogeneity and networked agents separately. In this work, we introduce a setting that encompasses both features. For this novel setting, we first provide a rigorous regret analysis for a standard flooding protocol combined with the classic UCB policy. Then, to mitigate the issue of high communication costs incurred by flooding in complex networks, we propose a new protocol called Flooding with Absorption (FwA). We provide a theoretical analysis of the resulting regret bound and discuss the advantages of using FwA over flooding. Lastly, we experimentally verify on various scenarios, including dynamic networks, that FwA leads to significantly lower communication costs despite minimal regret performance loss compared to other network protocols.
    Identifying Linearly-Mixed Causal Representations from Multi-Node Interventions. (arXiv:2311.02695v1 [stat.ML])
    The task of inferring high-level causal variables from low-level observations, commonly referred to as causal representation learning, is fundamentally underconstrained. As such, recent works to address this problem focus on various assumptions that lead to identifiability of the underlying latent causal variables. A large corpus of these preceding approaches consider multi-environment data collected under different interventions on the causal model. What is common to virtually all of these works is the restrictive assumption that in each environment, only a single variable is intervened on. In this work, we relax this assumption and provide the first identifiability result for causal representation learning that allows for multiple variables to be targeted by an intervention within one environment. Our approach hinges on a general assumption on the coverage and diversity of interventions across environments, which also includes the shared assumption of single-node interventions of previous works. The main idea behind our approach is to exploit the trace that interventions leave on the variance of the ground truth causal variables and regularizing for a specific notion of sparsity with respect to this trace. In addition to and inspired by our theoretical contributions, we present a practical algorithm to learn causal representations from multi-node interventional data and provide empirical evidence that validates our identifiability results.
    Deep Learning with Kernels through RKHM and the Perron-Frobenius Operator. (arXiv:2305.13588v2 [stat.ML] UPDATED)
    Reproducing kernel Hilbert $C^*$-module (RKHM) is a generalization of reproducing kernel Hilbert space (RKHS) by means of $C^*$-algebra, and the Perron-Frobenius operator is a linear operator related to the composition of functions. Combining these two concepts, we present deep RKHM, a deep learning framework for kernel methods. We derive a new Rademacher generalization bound in this setting and provide a theoretical interpretation of benign overfitting by means of Perron-Frobenius operators. By virtue of $C^*$-algebra, the dependency of the bound on output dimension is milder than existing bounds. We show that $C^*$-algebra is a suitable tool for deep learning with kernels, enabling us to take advantage of the product structure of operators and to provide a clear connection with convolutional neural networks. Our theoretical analysis provides a new lens through which one can design and analyze deep kernel methods.
    Using multimodal learning and deep generative models for corporate bankruptcy prediction. (arXiv:2211.08405v4 [q-fin.RM] UPDATED)
    Textual data from financial filings, e.g., the Management's Discussion \& Analysis (MDA) section in Form 10-K, has been used to improve the prediction accuracy of bankruptcy models. In practice, however, we cannot obtain the MDA section for all public companies. The two main reasons for the lack of MDA are: (i) not all companies are obliged to submit the MDA and (ii) technical problems arise when crawling and scrapping the MDA section. This research introduces for the first time, to the best of our knowledge, the concept of multimodal learning in bankruptcy prediction models to solve the problem that for some companies we are unable to obtain the MDA text. We use the Conditional Multimodal Discriminative (CMMD) model to learn multimodal representations that embed information from accounting, market, and textual modalities. The CMMD model needs a sample with all data modalities for model training. At test time, the CMMD model only needs access to accounting and market modalities to generate multimodal representations, which are further used to make bankruptcy predictions. This fact makes the use of bankruptcy prediction models using textual data realistic and possible, since accounting and market data are available for all companies unlike textual data. The empirical results in this research show that the classification performance of our proposed methodology is superior compared to that of a large number of traditional classifier models. We also show that our proposed methodology solves the limitation of previous bankruptcy models using textual data, as they can only make predictions for a small proportion of companies.
    Robust Meta-Representation Learning via Global Label Inference and Classification. (arXiv:2212.11702v2 [cs.LG] UPDATED)
    Few-shot learning (FSL) is a central problem in meta-learning, where learners must efficiently learn from few labeled examples. Within FSL, feature pre-training has recently become an increasingly popular strategy to significantly improve generalization performance. However, the contribution of pre-training is often overlooked and understudied, with limited theoretical understanding of its impact on meta-learning performance. Further, pre-training requires a consistent set of global labels shared across training tasks, which may be unavailable in practice. In this work, we address the above issues by first showing the connection between pre-training and meta-learning. We discuss why pre-training yields more robust meta-representation and connect the theoretical analysis to existing works and empirical results. Secondly, we introduce Meta Label Learning (MeLa), a novel meta-learning algorithm that learns task relations by inferring global labels across tasks. This allows us to exploit pre-training for FSL even when global labels are unavailable or ill-defined. Lastly, we introduce an augmented pre-training procedure that further improves the learned meta-representation. Empirically, MeLa outperforms existing methods across a diverse range of benchmarks, in particular under a more challenging setting where the number of training tasks is limited and labels are task-specific. We also provide extensive ablation study to highlight its key properties.
    Training Matters: Unlocking Potentials of Deeper Graph Convolutional Neural Networks. (arXiv:2008.08838v3 [cs.LG] UPDATED)
    The performance limit of Graph Convolutional Networks (GCNs) and the fact that we cannot stack more of them to increase the performance, which we usually do for other deep learning paradigms, are pervasively thought to be caused by the limitations of the GCN layers, including insufficient expressive power, etc. However, if so, for a fixed architecture, it would be unlikely to lower the training difficulty and to improve performance by changing only the training procedure, which we show in this paper not only possible but possible in several ways. This paper first identify the training difficulty of GCNs from the perspective of graph signal energy loss. More specifically, we find that the loss of energy in the backward pass during training nullifies the learning of the layers closer to the input. Then, we propose several methodologies to mitigate the training problem by slightly modifying the GCN operator, from the energy perspective. After empirical validation, we confirm that these changes of operator lead to significant decrease in the training difficulties and notable performance boost, without changing the composition of parameters. With these, we conclude that the root cause of the problem is more likely the training difficulty than the others.
    A Contrastive Approach to Online Change Point Detection. (arXiv:2206.10143v3 [stat.ML] UPDATED)
    We suggest a novel procedure for online change point detection. Our approach expands an idea of maximizing a discrepancy measure between points from pre-change and post-change distributions. This leads to a flexible procedure suitable for both parametric and nonparametric scenarios. We prove non-asymptotic bounds on the average running length of the procedure and its expected detection delay. The efficiency of the algorithm is illustrated with numerical experiments on synthetic and real-world data sets.
    A New Bandit Setting Balancing Information from State Evolution and Corrupted Context. (arXiv:2011.07989v4 [cs.LG] UPDATED)
    We propose a new sequential decision-making setting, combining key aspects of two established online learning problems with bandit feedback. The optimal action to play at any given moment is contingent on an underlying changing state which is not directly observable by the agent. Each state is associated with a context distribution, possibly corrupted, allowing the agent to identify the state. Furthermore, states evolve in a Markovian fashion, providing useful information to estimate the current state via state history. In the proposed problem setting, we tackle the challenge of deciding on which of the two sources of information the agent should base its arm selection. We present an algorithm that uses a referee to dynamically combine the policies of a contextual bandit and a multi-armed bandit. We capture the time-correlation of states through iteratively learning the action-reward transition model, allowing for efficient exploration of actions. Our setting is motivated by adaptive mobile health (mHealth) interventions. Users transition through different, time-correlated, but only partially observable internal states, determining their current needs. The side information associated with each internal state might not always be reliable, and standard approaches solely rely on the context risk of incurring high regret. Similarly, some users might exhibit weaker correlations between subsequent states, leading to approaches that solely rely on state transitions risking the same. We analyze our setting and algorithm in terms of regret lower bound and upper bounds and evaluate our method on simulated medication adherence intervention data and several real-world data sets, showing improved empirical performance compared to several popular algorithms.
    Independent finite approximations for Bayesian nonparametric inference. (arXiv:2009.10780v4 [stat.ME] UPDATED)
    Completely random measures (CRMs) and their normalizations (NCRMs) offer flexible models in Bayesian nonparametrics. But their infinite dimensionality presents challenges for inference. Two popular finite approximations are truncated finite approximations (TFAs) and independent finite approximations (IFAs). While the former have been well-studied, IFAs lack similarly general bounds on approximation error, and there has been no systematic comparison between the two options. In the present work, we propose a general recipe to construct practical finite-dimensional approximations for homogeneous CRMs and NCRMs, in the presence or absence of power laws. We call our construction the automated independent finite approximation (AIFA). Relative to TFAs, we show that AIFAs facilitate more straightforward derivations and use of parallel computing in approximate inference. We upper bound the approximation error of AIFAs for a wide class of common CRMs and NCRMs -- and thereby develop guidelines for choosing the approximation level. Our lower bounds in key cases suggest that our upper bounds are tight. We prove that, for worst-case choices of observation likelihoods, TFAs are more efficient than AIFAs. Conversely, we find that in real-data experiments with standard likelihoods, AIFAs and TFAs perform similarly. Moreover, we demonstrate that AIFAs can be used for hyperparameter estimation even when other potential IFA options struggle or do not apply.
    Forward $\chi^2$ Divergence Based Variational Importance Sampling. (arXiv:2311.02516v1 [cs.LG])
    Maximizing the log-likelihood is a crucial aspect of learning latent variable models, and variational inference (VI) stands as the commonly adopted method. However, VI can encounter challenges in achieving a high log-likelihood when dealing with complicated posterior distributions. In response to this limitation, we introduce a novel variational importance sampling (VIS) approach that directly estimates and maximizes the log-likelihood. VIS leverages the optimal proposal distribution, achieved by minimizing the forward $\chi^2$ divergence, to enhance log-likelihood estimation. We apply VIS to various popular latent variable models, including mixture models, variational auto-encoders, and partially observable generalized linear models. Results demonstrate that our approach consistently outperforms state-of-the-art baselines, both in terms of log-likelihood and model parameter estimation.
    Steady-State Analysis of Queues with Hawkes Arrival and Its Application to Online Learning for Hawkes Queues. (arXiv:2311.02577v1 [math.PR])
    We investigate the long-run behavior of single-server queues with Hawkes arrivals and general service distributions and related optimization problems. In detail, utilizing novel coupling techniques, we establish finite moment bounds for the stationary distribution of the workload and busy period processes. In addition, we are able to show that, those queueing processes converge exponentially fast to their stationary distribution. Based on these theoretic results, we develop an efficient numerical algorithm to solve the optimal staffing problem for the Hawkes queues in a data-driven manner. Numerical results indicate a sharp difference in staffing for Hawkes queues, compared to the classic GI/GI/1 model, especially in the heavy-traffic regime.
    Approximating Langevin Monte Carlo with ResNet-like Neural Network architectures. (arXiv:2311.03242v1 [cs.LG])
    We sample from a given target distribution by constructing a neural network which maps samples from a simple reference, e.g. the standard normal distribution, to samples from the target. To that end, we propose using a neural network architecture inspired by the Langevin Monte Carlo (LMC) algorithm. Based on LMC perturbation results, we show approximation rates of the proposed architecture for smooth, log-concave target distributions measured in the Wasserstein-$2$ distance. The analysis heavily relies on the notion of sub-Gaussianity of the intermediate measures of the perturbed LMC process. In particular, we derive bounds on the growth of the intermediate variance proxies under different assumptions on the perturbations. Moreover, we propose an architecture similar to deep residual neural networks and derive expressivity results for approximating the sample to target distribution map.
    Data-Dependent Bounds for Online Portfolio Selection Without Lipschitzness and Smoothness. (arXiv:2305.13946v2 [cs.LG] UPDATED)
    This work introduces the first small-loss and gradual-variation regret bounds for online portfolio selection, marking the first instances of data-dependent bounds for online convex optimization with non-Lipschitz, non-smooth losses. The algorithms we propose exhibit sublinear regret rates in the worst cases and achieve logarithmic regrets when the data is "easy," with per-iteration time almost linear in the number of investment alternatives. The regret bounds are derived using novel smoothness characterizations of the logarithmic loss, a local norm-based analysis of following the regularized leader (FTRL) with self-concordant regularizers, which are not necessarily barriers, and an implicit variant of optimistic FTRL with the log-barrier.
    Barron Space for Graph Convolution Neural Networks. (arXiv:2311.02838v1 [stat.ML])
    Graph convolutional neural network (GCNN) operates on graph domain and it has achieved a superior performance to accomplish a wide range of tasks. In this paper, we introduce a Barron space of functions on a compact domain of graph signals. We prove that the proposed Barron space is a reproducing kernel Banach space, it can be decomposed into the union of a family of reproducing kernel Hilbert spaces with neuron kernels, and it could be dense in the space of continuous functions on the domain. Approximation property is one of the main principles to design neural networks. In this paper, we show that outputs of GCNNs are contained in the Barron space and functions in the Barron space can be well approximated by outputs of some GCNNs in the integrated square and uniform measurements. We also estimate the Rademacher complexity of functions with bounded Barron norm and conclude that functions in the Barron space could be learnt from their random samples efficiently.  ( 2 min )
    An $\varepsilon$-Best-Arm Identification Algorithm for Fixed-Confidence and Beyond. (arXiv:2305.16041v2 [stat.ML] UPDATED)
    We propose EB-TC$\varepsilon$, a novel sampling rule for $\varepsilon$-best arm identification in stochastic bandits. It is the first instance of Top Two algorithm analyzed for approximate best arm identification. EB-TC$\varepsilon$ is an *anytime* sampling rule that can therefore be employed without modification for fixed confidence or fixed budget identification (without prior knowledge of the budget). We provide three types of theoretical guarantees for EB-TC$\varepsilon$. First, we prove bounds on its expected sample complexity in the fixed confidence setting, notably showing its asymptotic optimality in combination with an adaptive tuning of its exploration parameter. We complement these findings with upper bounds on its probability of error at any time and for any error parameter, which further yield upper bounds on its simple regret at any time. Finally, we show through numerical simulations that EB-TC$\varepsilon$ performs favorably compared to existing algorithms, in different settings.
    Low Tensor Rank Learning of Neural Dynamics. (arXiv:2308.11567v2 [q-bio.NC] UPDATED)
    Learning relies on coordinated synaptic changes in recurrently connected populations of neurons. Therefore, understanding the collective evolution of synaptic connectivity over learning is a key challenge in neuroscience and machine learning. In particular, recent work has shown that the weight matrices of task-trained RNNs are typically low rank, but how this low rank structure unfolds over learning is unknown. To address this, we investigate the rank of the 3-tensor formed by the weight matrices throughout learning. By fitting RNNs of varying rank to large-scale neural recordings during a motor learning task, we find that the inferred weights are low-tensor-rank and therefore evolve over a fixed low-dimensional subspace throughout the entire course of learning. We next validate the observation of low-tensor-rank learning on an RNN trained to solve the same task. Finally, we present a set of mathematical results bounding the matrix and tensor ranks of gradient descent learning dynamics which show that low-tensor-rank weights emerge naturally in RNNs trained to solve low-dimensional tasks. Taken together, our findings provide insight on the evolution of population connectivity over learning in both biological and artificial neural networks, and enable reverse engineering of learning-induced changes in recurrent dynamics from large-scale neural recordings.
    Neural Structure Learning with Stochastic Differential Equations. (arXiv:2311.03309v1 [cs.LG])
    Discovering the underlying relationships among variables from temporal observations has been a longstanding challenge in numerous scientific disciplines, including biology, finance, and climate science. The dynamics of such systems are often best described using continuous-time stochastic processes. Unfortunately, most existing structure learning approaches assume that the underlying process evolves in discrete-time and/or observations occur at regular time intervals. These mismatched assumptions can often lead to incorrect learned structures and models. In this work, we introduce a novel structure learning method, SCOTCH, which combines neural stochastic differential equations (SDE) with variational inference to infer a posterior distribution over possible structures. This continuous-time approach can naturally handle both learning from and predicting observations at arbitrary time points. Theoretically, we establish sufficient conditions for an SDE and SCOTCH to be structurally identifiable, and prove its consistency under infinite data limits. Empirically, we demonstrate that our approach leads to improved structure learning performance on both synthetic and real-world datasets compared to relevant baselines under regular and irregular sampling intervals.  ( 2 min )
    Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective. (arXiv:2305.15408v4 [cs.LG] UPDATED)
    Recent studies have discovered that Chain-of-Thought prompting (CoT) can dramatically improve the performance of Large Language Models (LLMs), particularly when dealing with complex tasks involving mathematics or reasoning. Despite the enormous empirical success, the underlying mechanisms behind CoT and how it unlocks the potential of LLMs remain elusive. In this paper, we take a first step towards theoretically answering these questions. Specifically, we examine the expressivity of LLMs with CoT in solving fundamental mathematical and decision-making problems. By using circuit complexity theory, we first give impossibility results showing that bounded-depth Transformers are unable to directly produce correct answers for basic arithmetic/equation tasks unless the model size grows super-polynomially with respect to the input length. In contrast, we then prove by construction that autoregressive Transformers of constant size suffice to solve both tasks by generating CoT derivations using a commonly used math language format. Moreover, we show LLMs with CoT can handle a general class of decision-making problems known as Dynamic Programming, thus justifying its power in tackling complex real-world tasks. Finally, an extensive set of experiments show that, while Transformers always fail to directly predict the answers, they can consistently learn to generate correct solutions step-by-step given sufficient CoT demonstrations.
    New Insights into Graph Convolutional Networks using Neural Tangent Kernels. (arXiv:2110.04060v2 [cs.LG] UPDATED)
    Graph Convolutional Networks (GCNs) have emerged as powerful tools for learning on network structured data. Although empirically successful, GCNs exhibit certain behaviour that has no rigorous explanation -- for instance, the performance of GCNs significantly degrades with increasing network depth, whereas it improves marginally with depth using skip connections. This paper focuses on semi-supervised learning on graphs, and explains the above observations through the lens of Neural Tangent Kernels (NTKs). We derive NTKs corresponding to infinitely wide GCNs (with and without skip connections). Subsequently, we use the derived NTKs to identify that, with suitable normalisation, network depth does not always drastically reduce the performance of GCNs -- a fact that we also validate through extensive simulation. Furthermore, we propose NTK as an efficient `surrogate model' for GCNs that does not suffer from performance fluctuations due to hyper-parameter tuning since it is a hyper-parameter free deterministic kernel. The efficacy of this idea is demonstrated through a comparison of different skip connections for GCNs using the surrogate NTKs.
    A unified recipe for deriving (time-uniform) PAC-Bayes bounds. (arXiv:2302.03421v4 [stat.ML] UPDATED)
    We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville's inequality. Our main result is a PAC-Bayes theorem which holds for a wide class of discrete stochastic processes. We show how this result implies time-uniform versions of well-known classical PAC-Bayes bounds, such as those of Seeger, McAllester, Maurer, and Catoni, in addition to many recent bounds. We also present several novel bounds. Our framework also enables us to relax traditional assumptions; in particular, we consider nonstationary loss functions and non-i.i.d. data. In sum, we unify the derivation of past bounds and ease the search for future bounds: one may simply check if our supermartingale or submartingale conditions are met and, if so, be guaranteed a (time-uniform) PAC-Bayes bound.
    Strong statistical parity through fair synthetic data. (arXiv:2311.03000v1 [cs.LG])
    AI-generated synthetic data, in addition to protecting the privacy of original data sets, allows users and data consumers to tailor data to their needs. This paper explores the creation of synthetic data that embodies Fairness by Design, focusing on the statistical parity fairness definition. By equalizing the learned target probability distributions of the synthetic data generator across sensitive attributes, a downstream model trained on such synthetic data provides fair predictions across all thresholds, that is, strong fair predictions even when inferring from biased, original data. This fairness adjustment can be either directly integrated into the sampling process of a synthetic generator or added as a post-processing step. The flexibility allows data consumers to create fair synthetic data and fine-tune the trade-off between accuracy and fairness without any previous assumptions on the data or re-training the synthetic data generator.
    A Theory for Emergence of Complex Skills in Language Models. (arXiv:2307.15936v2 [cs.LG] UPDATED)
    A major driver of AI products today is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws of LLMs and a simple statistical framework. Contributions include: (a) A statistical framework that relates cross-entropy loss of LLMs to competence on the basic skills that underlie language tasks. (b) Mathematical analysis showing that the Scaling Laws imply a strong form of inductive bias that allows the pre-trained model to learn very efficiently. We informally call this {\em slingshot generalization} since naively viewed it appears to give competence levels at skills that violate usual generalization theory. (c) A key example of slingshot generalization, that competence at executing tasks involving $k$-tuples of skills emerges essentially at the same scaling and same rate as competence on the elementary skills themselves.
    Adaptive Linear Estimating Equations. (arXiv:2307.07320v2 [math.ST] UPDATED)
    Sequential data collection has emerged as a widely adopted technique for enhancing the efficiency of data gathering processes. Despite its advantages, such data collection mechanism often introduces complexities to the statistical inference procedure. For instance, the ordinary least squares (OLS) estimator in an adaptive linear regression model can exhibit non-normal asymptotic behavior, posing challenges for accurate inference and interpretation. In this paper, we propose a general method for constructing debiased estimator which remedies this issue. It makes use of the idea of adaptive linear estimating equations, and we establish theoretical guarantees of asymptotic normality, supplemented by discussions on achieving near-optimal asymptotic variance. A salient feature of our estimator is that in the context of multi-armed bandits, our estimator retains the non-asymptotic performance of the least square estimator while obtaining asymptotic normality property. Consequently, this work helps connect two fruitful paradigms of adaptive inference: a) non-asymptotic inference using concentration inequalities and b) asymptotic inference via asymptotic normality.
    One-Shot Strategic Classification Under Unknown Costs. (arXiv:2311.02761v1 [cs.LG])
    A primary goal in strategic classification is to learn decision rules which are robust to strategic input manipulation. Earlier works assume that strategic responses are known; while some recent works address the important challenge of unknown responses, they exclusively study sequential settings which allow multiple model deployments over time. But there are many domains$\unicode{x2014}$particularly in public policy, a common motivating use-case$\unicode{x2014}$where multiple deployments are unrealistic, or where even a single bad round is undesirable. To address this gap, we initiate the study of strategic classification under unknown responses in the one-shot setting, which requires committing to a single classifier once. Focusing on the users' cost function as the source of uncertainty, we begin by proving that for a broad class of costs, even a small mis-estimation of the true cost can entail arbitrarily low accuracy in the worst case. In light of this, we frame the one-shot task as a minimax problem, with the goal of identifying the classifier with the smallest worst-case risk over an uncertainty set of possible costs. Our main contribution is efficient algorithms for both the full-batch and stochastic settings, which we prove converge (offline) to the minimax optimal solution at the dimension-independent rate of $\tilde{\mathcal{O}}(T^{-\frac{1}{2}})$. Our analysis reveals important structure stemming from the strategic nature of user responses, particularly the importance of dual norm regularization with respect to the cost function.  ( 2 min )
    Heteroskedastic Tensor Clustering. (arXiv:2311.02306v1 [math.ST])
    Tensor clustering, which seeks to extract underlying cluster structures from noisy tensor observations, has gained increasing attention. One extensively studied model for tensor clustering is the tensor block model, which postulates the existence of clustering structures along each mode and has found broad applications in areas like multi-tissue gene expression analysis and multilayer network analysis. However, currently available computationally feasible methods for tensor clustering either are limited to handling i.i.d. sub-Gaussian noise or suffer from suboptimal statistical performance, which restrains their utility in applications that have to deal with heteroskedastic data and/or low signal-to-noise-ratio (SNR). To overcome these challenges, we propose a two-stage method, named $\mathsf{High\text{-}order~HeteroClustering}$ ($\mathsf{HHC}$), which starts by performing tensor subspace estimation via a novel spectral algorithm called $\mathsf{Thresholded~Deflated\text{-}HeteroPCA}$, followed by approximate $k$-means to obtain cluster nodes. Encouragingly, our algorithm provably achieves exact clustering as long as the SNR exceeds the computational limit (ignoring logarithmic factors); here, the SNR refers to the ratio of the pairwise disparity between nodes to the noise level, and the computational limit indicates the lowest SNR that enables exact clustering with polynomial runtime. Comprehensive simulation and real-data experiments suggest that our algorithm outperforms existing algorithms across various settings, delivering more reliable clustering performance.  ( 2 min )
    Validity problems in clinical machine learning by indirect data labeling using consensus definitions. (arXiv:2311.03037v1 [cs.LG])
    We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included in the input data representation. Machine learning models trained on this data will learn nothing else but to exactly reconstruct the known target definition. Such models show perfect performance on similarly constructed test data but will fail catastrophically on real-world examples where the defining fundamental measurements are not or only incompletely available. We present a general procedure allowing identification of problematic datasets and black-box machine learning models trained on them, and exemplify our detection procedure on the task of early prediction of sepsis.  ( 2 min )
    Weight-Sharing Regularization. (arXiv:2311.03096v1 [cs.LG])
    Weight-sharing is ubiquitous in deep learning. Motivated by this, we introduce ''weight-sharing regularization'' for neural networks, defined as $R(w) = \frac{1}{d - 1}\sum_{i > j}^d |w_i - w_j|$. We study the proximal mapping of $R$ and provide an intuitive interpretation of it in terms of a physical system of interacting particles. Using this interpretation, we design a novel parallel algorithm for $\operatorname{prox}_R$ which provides an exponential speedup over previous algorithms, with a depth of $O(\log^3 d)$. Our algorithm makes it feasible to train weight-sharing regularized deep neural networks with proximal gradient descent. Experiments reveal that weight-sharing regularization enables fully-connected networks to learn convolution-like filters.  ( 2 min )
    Practical considerations for variable screening in the Super Learner. (arXiv:2311.03313v1 [stat.ML])
    Estimating a prediction function is a fundamental component of many data analyses. The Super Learner ensemble, a particular implementation of stacking, has desirable theoretical properties and has been used successfully in many applications. Dimension reduction can be accomplished by using variable screening algorithms, including the lasso, within the ensemble prior to fitting other prediction algorithms. However, the performance of a Super Learner using the lasso for dimension reduction has not been fully explored in cases where the lasso is known to perform poorly. We provide empirical results that suggest that a diverse set of candidate screening algorithms should be used to protect against poor performance of any one screen, similar to the guidance for choosing a library of prediction algorithms for the Super Learner.  ( 2 min )
    Variational Weighting for Kernel Density Ratios. (arXiv:2311.03001v1 [cs.LG])
    Kernel density estimation (KDE) is integral to a range of generative and discriminative tasks in machine learning. Drawing upon tools from the multidimensional calculus of variations, we derive an optimal weight function that reduces bias in standard kernel density estimates for density ratios, leading to improved estimates of prediction posteriors and information-theoretic measures. In the process, we shed light on some fundamental aspects of density estimation, particularly from the perspective of algorithms that employ KDEs as their main building blocks.  ( 2 min )
    ELEGANT: Certified Defense on the Fairness of Graph Neural Networks. (arXiv:2311.02757v1 [cs.LG])
    Graph Neural Networks (GNNs) have emerged as a prominent graph learning model in various graph-based tasks over the years. Nevertheless, due to the vulnerabilities of GNNs, it has been empirically proved that malicious attackers could easily corrupt the fairness level of their predictions by adding perturbations to the input graph data. In this paper, we take crucial steps to study a novel problem of certifiable defense on the fairness level of GNNs. Specifically, we propose a principled framework named ELEGANT and present a detailed theoretical certification analysis for the fairness of GNNs. ELEGANT takes any GNNs as its backbone, and the fairness level of such a backbone is theoretically impossible to be corrupted under certain perturbation budgets for attackers. Notably, ELEGANT does not have any assumption over the GNN structure or parameters, and does not require re-training the GNNs to realize certification. Hence it can serve as a plug-and-play framework for any optimized GNNs ready to be deployed. We verify the satisfactory effectiveness of ELEGANT in practice through extensive experiments on real-world datasets across different backbones of GNNs, where ELEGANT is also demonstrated to be beneficial for GNN debiasing. Open-source code can be found at https://github.com/yushundong/ELEGANT.  ( 2 min )
    Regularized Linear Regression for Binary Classification. (arXiv:2311.02270v1 [cs.LG])
    Regularized linear regression is a promising approach for binary classification problems in which the training set has noisy labels since the regularization term can help to avoid interpolating the mislabeled data points. In this paper we provide a systematic study of the effects of the regularization strength on the performance of linear classifiers that are trained to solve binary classification problems by minimizing a regularized least-squares objective. We consider the over-parametrized regime and assume that the classes are generated from a Gaussian Mixture Model (GMM) where a fraction $c<\frac{1}{2}$ of the training data is mislabeled. Under these assumptions, we rigorously analyze the classification errors resulting from the application of ridge, $\ell_1$, and $\ell_\infty$ regression. In particular, we demonstrate that ridge regression invariably improves the classification error. We prove that $\ell_1$ regularization induces sparsity and observe that in many cases one can sparsify the solution by up to two orders of magnitude without any considerable loss of performance, even though the GMM has no underlying sparsity structure. For $\ell_\infty$ regularization we show that, for large enough regularization strength, the optimal weights concentrate around two values of opposite sign. We observe that in many cases the corresponding "compression" of each weight to a single bit leads to very little loss in performance. These latter observations can have significant practical ramifications.  ( 2 min )
    Riemannian Laplace Approximation with the Fisher Metric. (arXiv:2311.02766v1 [cs.LG])
    The Laplace's method approximates a target density with a Gaussian distribution at its mode. It is computationally efficient and asymptotically exact for Bayesian inference due to the Bernstein-von Mises theorem, but for complex targets and finite-data posteriors it is often too crude an approximation. A recent generalization of the Laplace Approximation transforms the Gaussian approximation according to a chosen Riemannian geometry providing a richer approximation family, while still retaining computational efficiency. However, as shown here, its properties heavily depend on the chosen metric, indeed the metric adopted in previous work results in approximations that are overly narrow as well as being biased even at the limit of infinite data. We correct this shortcoming by developing the approximation family further, deriving two alternative variants that are exact at the limit of infinite data, extending the theoretical analysis of the method, and demonstrating practical improvements in a range of experiments.  ( 2 min )
    From Coupled Oscillators to Graph Neural Networks: Reducing Over-smoothing via a Kuramoto Model-based Approach. (arXiv:2311.03260v1 [cs.LG])
    We propose the Kuramoto Graph Neural Network (KuramotoGNN), a novel class of continuous-depth graph neural networks (GNNs) that employs the Kuramoto model to mitigate the over-smoothing phenomenon, in which node features in GNNs become indistinguishable as the number of layers increases. The Kuramoto model captures the synchronization behavior of non-linear coupled oscillators. Under the view of coupled oscillators, we first show the connection between Kuramoto model and basic GNN and then over-smoothing phenomenon in GNNs can be interpreted as phase synchronization in Kuramoto model. The KuramotoGNN replaces this phase synchronization with frequency synchronization to prevent the node features from converging into each other while allowing the system to reach a stable synchronized state. We experimentally verify the advantages of the KuramotoGNN over the baseline GNNs and existing methods in reducing over-smoothing on various graph deep learning benchmark tasks.  ( 2 min )
    Spatial Process Approximations: Assessing Their Necessity. (arXiv:2311.03201v1 [stat.ML])
    In spatial statistics and machine learning, the kernel matrix plays a pivotal role in prediction, classification, and maximum likelihood estimation. A thorough examination reveals that for large sample sizes, the kernel matrix becomes ill-conditioned, provided the sampling locations are fairly evenly distributed. This condition poses significant challenges to numerical algorithms used in prediction and estimation computations and necessitates an approximation to prediction and the Gaussian likelihood. A review of current methodologies for managing large spatial data indicates that some fail to address this ill-conditioning problem. Such ill-conditioning often results in low-rank approximations of the stochastic processes. This paper introduces various optimality criteria and provides solutions for each.  ( 2 min )
    On Subagging Boosted Probit Model Trees. (arXiv:2311.02827v1 [stat.ML])
    With the insight of variance-bias decomposition, we design a new hybrid bagging-boosting algorithm named SBPMT for classification problems. For the boosting part of SBPMT, we propose a new tree model called Probit Model Tree (PMT) as base classifiers in AdaBoost procedure. For the bagging part, instead of subsampling from the dataset at each step of boosting, we perform boosted PMTs on each subagged dataset and combine them into a powerful "committee", which can be viewed an incomplete U-statistic. Our theoretical analysis shows that (1) SBPMT is consistent under certain assumptions, (2) Increase the subagging times can reduce the generalization error of SBPMT to some extent and (3) Large number of ProbitBoost iterations in PMT can benefit the performance of SBPMT with fewer steps in the AdaBoost part. Those three properties are verified by a famous simulation designed by Mease and Wyner (2008). The last two points also provide a useful guidance in model tuning. A comparison of performance with other state-of-the-art classification methods illustrates that the proposed SBPMT algorithm has competitive prediction power in general and performs significantly better in some cases.  ( 2 min )
    Nonparametric modeling of the composite effect of multiple nutrients on blood glucose dynamics. (arXiv:2311.03129v1 [stat.ML])
    In biomedical applications it is often necessary to estimate a physiological response to a treatment consisting of multiple components, and learn the separate effects of the components in addition to the joint effect. Here, we extend existing probabilistic nonparametric approaches to explicitly address this problem. We also develop a new convolution-based model for composite treatment-response curves that is more biologically interpretable. We validate our models by estimating the impact of carbohydrate and fat in meals on blood glucose. By differentiating treatment components, incorporating their dosages, and sharing statistical information across patients via a hierarchical multi-output Gaussian process, our method improves prediction accuracy over existing approaches, and allows us to interpret the different effects of carbohydrates and fat on the overall glucose response.  ( 2 min )
    Bayesian Optimization of Function Networks with Partial Evaluations. (arXiv:2311.02146v1 [stat.ML])
    Bayesian optimization is a framework for optimizing functions that are costly or time-consuming to evaluate. Recent work has considered Bayesian optimization of function networks (BOFN), where the objective function is computed via a network of functions, each taking as input the output of previous nodes in the network and additional parameters. Exploiting this network structure has been shown to yield significant performance improvements. Existing BOFN algorithms for general-purpose networks are required to evaluate the full network at each iteration. However, many real-world applications allow evaluating nodes individually. To take advantage of this opportunity, we propose a novel knowledge gradient acquisition function for BOFN that chooses which node to evaluate as well as the inputs for that node in a cost-aware fashion. This approach can dramatically reduce query costs by allowing the evaluation of part of the network at a lower cost relative to evaluating the entire network. We provide an efficient approach to optimizing our acquisition function and show it outperforms existing BOFN methods and other benchmarks across several synthetic and real-world problems. Our acquisition function is the first to enable cost-aware optimization of a broad class of function networks.  ( 2 min )
    Estimating treatment effects from single-arm trials via latent-variable modeling. (arXiv:2311.03002v1 [cs.LG])
    Randomized controlled trials (RCTs) are the accepted standard for treatment effect estimation but they can be infeasible due to ethical reasons and prohibitive costs. Single-arm trials, where all patients belong to the treatment group, can be a viable alternative but require access to an external control group. We propose an identifiable deep latent-variable model for this scenario that can also account for missing covariate observations by modeling their structured missingness patterns. Our method uses amortized variational inference to learn both group-specific and identifiable shared latent representations, which can subsequently be used for (i) patient matching if treatment outcomes are not available for the treatment group, or for (ii) direct treatment effect estimation assuming outcomes are available for both groups. We evaluate the model on a public benchmark as well as on a data set consisting of a published RCT study and real-world electronic health records. Compared to previous methods, our results show improved performance both for direct treatment effect estimation as well as for effect estimation via patient matching.  ( 2 min )
    Log-Concavity of Multinomial Likelihood Functions Under Interval Censoring Constraints on Frequencies or Their Partial Sums. (arXiv:2311.02763v1 [math.ST])
    We show that the likelihood function for a multinomial vector observed under arbitrary interval censoring constraints on the frequencies or their partial sums is completely log-concave by proving that the constrained sample spaces comprise M-convex subsets of the discrete simplex.  ( 2 min )
    Exploiting Correlated Auxiliary Feedback in Parameterized Bandits. (arXiv:2311.02715v1 [cs.LG])
    We study a novel variant of the parameterized bandits problem in which the learner can observe additional auxiliary feedback that is correlated with the observed reward. The auxiliary feedback is readily available in many real-life applications, e.g., an online platform that wants to recommend the best-rated services to its users can observe the user's rating of service (rewards) and collect additional information like service delivery time (auxiliary feedback). In this paper, we first develop a method that exploits auxiliary feedback to build a reward estimator with tight confidence bounds, leading to a smaller regret. We then characterize the regret reduction in terms of the correlation coefficient between reward and its auxiliary feedback. Experimental results in different settings also verify the performance gain achieved by our proposed method.  ( 2 min )
    p-Laplacian Transformer. (arXiv:2311.03235v1 [cs.LG])
    $p$-Laplacian regularization, rooted in graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data. Smaller values of $p$ promote sparsity and interpretability, while larger values encourage smoother solutions. In this paper, we first show that the self-attention mechanism obtains the minimal Laplacian regularization ($p=2$) and encourages the smoothness in the architecture. However, the smoothness is not suitable for the heterophilic structure of self-attention in transformers where attention weights between tokens that are in close proximity and non-close ones are assigned indistinguishably. From that insight, we then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT), which leverages $p$-Laplacian regularization framework to harness the heterophilic features within self-attention layers. In particular, low $p$ values will effectively assign higher attention weights to tokens that are in close proximity to the current token being processed. We empirically demonstrate the advantages of p-LaT over the baseline transformers on a wide range of benchmark datasets.  ( 2 min )
    Structured Neural Networks for Density Estimation and Causal Inference. (arXiv:2311.02221v1 [cs.LG])
    Injecting structure into neural networks enables learning functions that satisfy invariances with respect to subsets of inputs. For instance, when learning generative models using neural networks, it is advantageous to encode the conditional independence structure of observed variables, often in the form of Bayesian networks. We propose the Structured Neural Network (StrNN), which injects structure through masking pathways in a neural network. The masks are designed via a novel relationship we explore between neural network architectures and binary matrix factorization, to ensure that the desired independencies are respected. We devise and study practical algorithms for this otherwise NP-hard design problem based on novel objectives that control the model architecture. We demonstrate the utility of StrNN in three applications: (1) binary and Gaussian density estimation with StrNN, (2) real-valued density estimation with Structured Autoregressive Flows (StrAFs) and Structured Continuous Normalizing Flows (StrCNF), and (3) interventional and counterfactual analysis with StrAFs for causal inference. Our work opens up new avenues for learning neural networks that enable data-efficient generative modeling and the use of normalizing flows for causal effect estimation.  ( 2 min )
    Improved Convergence Rates of Anderson Acceleration for a Large Class of Fixed-Point Iterations. (arXiv:2311.02490v1 [math.NA])
    This paper studies Anderson acceleration (AA) for fixed-point methods ${x}^{(k+1)}=q({x}^{(k)})$. It provides the first proof that when the operator $q$ is linear and symmetric, AA improves the root-linear convergence factor over the fixed-point iterations. When $q$ is nonlinear, yet has a symmetric Jacobian at the solution, a slightly modified AA algorithm is proved to have an analogous root-linear convergence factor improvement over fixed-point iterations. Simulations verify our observations. Furthermore, experiments with different data models demonstrate AA is significantly superior to the standard fixed-point methods for Tyler's M-estimation.  ( 2 min )

  • Open

    [R] Are there any research papers which show why Wasserstein distance is better than Jensen-Shannon/KL_divergence/Bhattacharya distance for specific use cases ?
    I am trying to find reliable research work which show why displacement based metrics such as Wasserstein distance is a better suited metric than Jensen-Shannon distance in specific use cases and for certain set of distributions. Are there any well known works which delve into this ? submitted by /u/V1bicycle [link] [comments]  ( 9 min )
    [R] Animating NeRFs from Texture Space: A Framework for Pose-Dependent Rendering of Human Performances
    The key challenge is that NeRFs typically require multiple view images to reconstruct a scene in 3D, whereas videos provide only a single view over time. But that means we have to capture a lot of data to create a NeRF. What if there was a way to create 3D animated models of humans from monocular video footage using NeRFs? A new paper addresses this with a novel approach. First, they fit a parametric model (SMPL) to align with the subject in each frame of the video. This provides an initial estimate of the 3D shape. Second, they transform the coordinate system of the NeRF based on the surface of the SMPL model. This involves projecting input points onto the model's surface and calculating distances to the surface. Third, they incorporate the SMPL model's joint rotations to animate it in a variety of poses based on the video. This adds important pose-dependent shape cues. Finally, they use a neural network module to further refine the coordinate transform, correcting any inaccuracies in the SMPL fit to ensure spatial alignments are accurate for rendering. In experiments, they demonstrate their method generates high-quality renderings of subjects in novel views and poses not seen in the original video footage. The results capture nuanced clothing and hair deformations in a pose-dependent way. There are some example photos in the article that really show this off. Limitations exist for handling extremely complex motions and generating detailed face/hand geometry from low-resolution videos. But overall, the technique significantly advances the state-of-the-art in reconstructing animatable human models from monocular video. TLDR: They found a new NeRF technique to turn videos into controllable 3D models Full paper summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [P] Want to create an AI to play Super Mario for the DS
    As the title says i want to make an AI that will play New Super Mario Bros. for the DS. I want to be lead in the right direction for any resources that can help me achieve the completion of this project. I know a fair bit about Neural Networks but I am stumped on how to achieve this. I am using python and pytorch to be able to make this happen. Thank you for your responses submitted by /u/antonio_porven [link] [comments]  ( 9 min )
    [P] Model training bottlenecked by CPU.
    Recently, I've been working on some projects for fun, trying out some things I hadn't worked with before, such as profiling. But after profiling my code, I found out that my average GPU activity is around 50%. Apparently, the code frequently hangs for a few hundred milliseconds on the dataloader process. I've tried a few things in the dataloader: increasing/decreasing the number of workers, setting pin-memory to true or false, but neither seems to really matter. I have an NVME drive, so the disk is not the problem either. I've concluded that the bottleneck must be the CPU. Now, I've read that pre-processing the data might help, so that the dataloader doesn't have to decode the images, for example, but I don't really know how to go about this. I have around 2TB of NVME storage, and I've got a couple datasets on the disk (ImageNet and INaturalist are the two biggest ones), so I don't suppose I'll be able to store them on the disk uncompressed. Is there anything I can do to lighten the load on the CPU during training so that I can take advantage of the 50% of the GPU that I'm not using at the moment? submitted by /u/AdSignificant9235 [link] [comments]  ( 9 min )
    [Project] I made a learning environment/game using unity's mlagents and used DeepRL to train an AI agent to autonomously race a car around. (demo available)
    ​ DEMO AVAILABLE submitted by /u/Sookeyy [link] [comments]  ( 9 min )
    Is it possible and feasible to indentify roofing damage, age of roof, and roof type using satellite imagery and a CNN? [R]
    Working on a little side project. How would you do it? How would you start collecting data for this? I’m assuming the hardest part is going to be getting the data. submitted by /u/Longjumping-Name7564 [link] [comments]  ( 9 min )
    Is anyone familiar with methods of evaluating if a model was trained on some data? [R]
    I have a model and some articles. I'm trying to figure out if the model was trained on these articles before. Is anyone familiar with existing methods on how to do this? I know Microsoft had one method here (Appendix B) where they did the following: "for each content datapoint in a dataset, we partition the content in half. We pass in the first half as context and ask the model to generate up to the length of the second half. We then measure how similar the generated content is to the held-out second half using Levenshtein distance ratio, which is defined as one minus the ratio of Levenshtein distance to the maximum possible distance." Wondering if anyone is familiar with other methods submitted by /u/DaBobcat [link] [comments]  ( 9 min )
    [D] A Comprehensive Hand-Curated Resource List for Best OpenAI-GPTs
    Greetings, Excited to share with all those interested in GPTs released by OpenAI at Devday. We are a group of researchers who have carefully curated a comprehensive list on Github of the best GPT models, including descriptions, URLs, and other details. While our initial focus is on GPT models available through OpenAI, we will continuously maintain and update the list as new models are released. Resource list: https://github.com/promptslab/Awesome-Openai-GPTs We hope it will help you to get started & learn more about GPTs. Thank you :) https://preview.redd.it/1pjon2m6azyb1.png?width=1678&format=png&auto=webp&s=9999c13c20f429a8201bf0f1ce51489765c9838d ​ submitted by /u/Alternative-File-146 [link] [comments]  ( 9 min )
    [R] Suggestions for research topics in Neural Network pruning?
    Sorry for the vague question but I’m looking for relevant research topics related to NN pruning. Any suggestions are appreciated, thanks! submitted by /u/Sidekiiick02 [link] [comments]  ( 9 min )
    [discussion] Chatgpt and machine learning studies
    Hello as someone interested by ai and machine learning is it worth it to learn it when we have chatgpt and similar solutions that keeps getting bigger and bigger? Or is it better to go the classic software development route to buile the interfaces and apps for ml solutions? Thnks submitted by /u/Particular_Tea2307 [link] [comments]  ( 9 min )
    [D] Changing model architecture in transformers. Is it me or the library?
    I've got the following issue: Sometimes I want to adjust the architecture of huggingface model, e.g. for implementing the AnyMal Paper on Zephyr. It is not that difficult, just adding another input, concatenating with the text input. While I know how to do it in pure PyTorch, I struggle "hacking" the hugginface models. I end up wrapping them into nn.Module, which creates a mess. Transformers Trainer is not usable in that case, so I end up creating a training loop from scratch (which lacks all the comfort). Alternatively, I mess with the custom dataset and data collator. Is there any simple trick on how to modify the architecture without breaking the transformers models and loosing all the benefits and custom implementation details? submitted by /u/enricopallazo1 [link] [comments]  ( 9 min )
    [D] Master's degree thesis
    Hello everyone, I'm completing my Master's Degree in Artificial Intelligence in Italy and I have to choose the thesis subject. I'd like to deepen my knowledge about LLMs since they are the thing of the moment and we didn't study them during the course. My initial idea was to try to work on hallucinations by integrating external knowledge from Ontologies/Knowledge Bases, probably implementing some form of RAG. However in the last weeks a lot of papers were published on the subject and I don't know if it will still be any relevant when i graduate (in 6 months or so). Also I haven't validated the idea yet and don't know if it makes any sense. Do you have any suggestions on LLMs related topics that I can study for my thesis? submitted by /u/Lonely-Lingonberry-5 [link] [comments]  ( 9 min )
    [R] A Systematic Review of Deep Graph Neural Networks: Challenges, Classification, Architectures, Applications & Potential Utility in Bioinformatics
    Paper: https://arxiv.org/abs/2311.02127 Abstract: In recent years, tasks of machine learning ranging from image processing & audio/video analysis to natural language understanding have been transformed by deep learning. The data content in all these scenarios are expressed via Euclidean space. However, a considerable amount of application data is structured in non-Euclidean space and is expressed as graphs, e.g. dealing with complicated interactions & object interdependencies. Modelling physical systems, learning molecular signatures, identifying protein interactions and predicting diseases involve utilising a model that can adapt from graph data. Graph neural networks (GNNs), specified as artificial-neural models, employ message transmission between graph nodes to represent graph dependencies and are primarily used in the non-Euclidean domain. Variants of GNN like Graph Recurrent Networks (GRN), Graph Auto Encoder (GAE), Graph Convolution Networks (GCN), Graph Adversarial Methods & Graph Reinforcement learning have exhibited breakthrough productivity on a wide range of tasks, especially in the field of bioinformatics, in recent years as a result of the rapid collection of biological network data. Apart from presenting all existing GNN models, mathematical analysis and comparison of the variants of all types of GNN have been highlighted in this survey. Graph neural networks are investigated for their potential real-world applications in various fields, focusing on Bioinformatics. Furthermore, resources for evaluating graph neural network models and accessing open-source code & benchmark data sets are included. Ultimately, we provide some (seven) proposals for future research in this rapidly evolving domain. GNNs have the potential to be an excellent tool for solving a wide range of biological challenges in bioinformatics research, as they are best represented as connected complex graphs. ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [Discussion] Going from PoC to Production-ready AI data infra?
    Many of the best practices from ELT fully apply to the emerging AI tech stack, but there is an extra step missing - publishing transformed data to vector data stores. Airbyte suggests “ELTP”, or Extract, Load, Transform, and Publish, as an extensible architecture for delivering data to vector databases. How are folks syncing data between their sources and vector databases? ​ submitted by /u/Chemical-Treat6596 [link] [comments]  ( 9 min )
    [D] How to learn about distributed training
    Hello, I have been using a single GPU for my school but now there is a need to do distributed training. How should I learn about it? Eventually I want to use deepspeed, huggingface, etc. submitted by /u/rodeowrong [link] [comments]  ( 9 min )
    [P] Launching Arch (Feedback appreciated!): Simplifying Multi-tenant Data Integration for ML/AI applications
    Hello r/MachineLearning and developer community! We’re the team behind Meltano, an open source project for data movement and integration. Today we’re announcing our new company name and product: Arch! We’re building Arch so you can build AI-powered features on your customer’s data without having to worry about building bespoke data infrastructure just to access the data. A clear target for us in building our initial version was to aim to build a tool that could power LLM-based applications, for multi-tenant use cases, using any data, which would be loved by software engineers. What we got so far is: A great demo + two blog posts detailing what we’re in the process of building + a waitlist if you’re interested in catching up. Of course, you can also schedule a demo. Key Highlights of Arch, from what we’ve already built: Fresh Data from Anywhere: Leverage data from any source — APIs, databases, files — seamlessly integrated into Arch. Auto Vector-embeddings: Build AI features fast, with automatic generation and synchronization of vector embeddings for any data. Effortless Multi-Tenancy: Secure and easy management of third-party authentication including OAuth, supporting numerous customers with customizable per-tenant SQL transformations. Flexible Data Access: Access any data table through SQL, GraphQL, or REST directly from your application. Developer-Centric Design: Arch is built with developers in mind, featuring a declarative approach and integrated best practices for version control. If this is of any interest to you, we’d love to chat. Also very happy to answer any questions here. Blog post: https://www.arch.dev/blog/announcing-arch-the-data-backend-for-ai-products/ GitHub: https://github.com/archdotdev submitted by /u/tayloramurphy [link] [comments]
    Panel time series forecasting [p]
    I am stuck in this project for months. I have a panel time series, 2000 time series and each time series only 14 points length. And many covariates ( 20). I need to do prediction for a single target variable while taking into consideration all the covariates. If you have ideas, links, resources that can help me solving this problem, please let me know. Thanks a lot. submitted by /u/Beginner4ever [link] [comments]  ( 9 min )
    [D] Optimization Problem
    There's a trading system that has a few parameters (like TP(take profit), SL(stop loss), Time limit and a couple of other ones). As you might know, there's a certain limit for each of them. Let's say we want the TP to be between 1% and 3% (always). And this happens for all the parameters as well. I specifically want to know how I can find the most suitable set of parameters. For example if tp is 2%, sl is 8%, time limit is 2000 minutes, .... the result would be less riskier(less profit). Since the number of parameters is not low (there're 10 of them) and the range for each of them is not discrete, it's practically impossible for us to run the system for each set of parameters. Therefore we tried using an optimization algorithm to be able to find the desirable result. We ended up using Genetic algorithm. It wasn't bad but turned out was not the best option we could use. Since it takes a lot of time to run (initial population should be 1000 and for each single run it takes 20 minutes, so...) Just consider the fact that the results of close values for parameters are not similar, for example if tp = 1.5%, sl = 8%, time limit = 2000 and I only change tp to 1.8%, the result totally changes! and you can't predict how it will affect the system. So there is a need for an optimization algorithm that can find the numerical values of them. Any ideas are highly appreciated! submitted by /u/Felicity_222 [link] [comments]  ( 9 min )
    [D] How to Build Data Products | Evolve: Part 4/4 Advanced SLOs, Feedback Loops, Optimised Data Product, and more!
    In this article, authors have discussed the evolve stage of 4 part series on how to build data products. Relevant for all the modern age data folks! What awaits you in this succinct piece: >>Introduction: ‘Evolve’ at a Glance, Relevance >>Fundamentals —— Evolutionary Architecture —— Fitness Functions —— Fitness Parameters —— Dynamic Changes —— Feedback Loops >>Evolving Data Products with Self-Serve Data Platforms —— Metrics Monitoring, SLO Optimisation, Catalysts for Higher Adoption —— Progressive Inclusion of Multiple Use Cases —— Resource Optimisation —— Maitenance Automation —— RCA & Log Analysis Enhancements —— Other Capabilities Critical to Evolve Stage >>Fitness in the Context of Data Mesh >>Read the complete article here: https://moderndata101.substack.com/p/evolving-data-products What are your thoughts on the entire approach??? submitted by /u/growth_man [link] [comments]  ( 9 min )
    [D] Is a career in Machine Learning satisfying for Linguists?
    TItle says it all; I have a passion for linguistics and want to apply it outside the academe. As far as I can tell, the majority of machine learning is dealing with databases, but for anyone that's dealt with LLMs, how much of linguistics is actually applied? submitted by /u/Steak-Burrito [link] [comments]  ( 9 min )
    [Research] Looking for an incomplete dataset that should be messy or contain various data quality issues.
    Hello, Reddit community, I'm working on a project that focuses on query-oriented data cleaning with human expert involvement, and I'm in search of a suitable dataset to support this research. The dataset should ideally contain messy or incomplete data. If you know of any relevant datasets or sources where I can find such data, I would greatly appreciate your assistance. Additionally, if you have any suggestions or insights on where to look for datasets with data quality issues, please feel free to share them. Thank you in advance for your help and suggestions! submitted by /u/thelifeofZ080 [link] [comments]
    [P] Residual-free, purely-feedforward network trains to 94% on CIFAR10 in <6.3 seconds on a single A100
    Release link: https://github.com/tysam-code/hlb-CIFAR10/releases/tag/v0.7.0 Hi, I'm sure some of you have seen this project as it's progressed over the past year or so. Today we've hit a pretty big milestone by removing residual layers from the network without reducing the accuracy or convergence time under (what are pretty brutal) optimization conditions. The network trains more quickly as a result of fewer kernel launches, though I am sure the learnable information flow also has some significant impact as well. There is a technical discussion at https://twitter.com/hi_tysam/status/1721764010159477161 that goes over more details. If you'd like a TL;DR -- enforcing information flow in neural networks via external operators (like channel concatenation in DenseNets and channel addition i…  ( 10 min )
    [D] Reverse engineering GPT-vision from pricing
    So I have been looking at GPT4-V pricing trying to determine what kind of pipeline they use, feel free to chime in, dispute, etc. I do not have many conclusions but hoping that the crowd is wiser. ​ https://preview.redd.it/t4udi6lnntyb1.png?width=552&format=png&auto=webp&s=6f02a8c62cb9eb5b104ef36f50d2e8d0ee7a431c Observations: you are billed on the token count which can be calculated from the image resolution. This suggests they do not do append the OCR result to the GPT4 input. There are 85 base tokens irrespective of the image size. Maybe they run the whole image through some vision encoder and somehow get 85 tokens? 85 is a strange number, not close base of 2, no convenient squares, what does it have? why 85? Maybe to mess with us? The image is tiled with 512x512 tiles, each tile converts to 170 tokens. 170 = 13*13+1? Maybe they use some kind of OCR and 170 is the average number of tokens they expect? But that would mean that gpt4 should not be able to differentiate small things in an image (it would just have 85 global tokens). Knowing OpenAI it seems unlikely they would have the 2 stage pipeline. GPT4-V can accurately read text from image. My guess would be a strong vision encoder for the 85 tokens and some light encoder for the 512x512 tiles, where most of the processing happens inside gpt4. Retrofitting GPT4 with vision suggests they have a vision encoder which maps to GPT4 tokens. ​ What do you think? submitted by /u/President_Xi_ [link] [comments]
    [N] GPT-4 Turbo with 128K of context
    https://openai.com/blog/new-models-and-developer-products-announced-at-devday Excited for the RAG implementations this will support. Also, turbo! submitted by /u/gar1t [link] [comments]
    [Project] What process is used in these reinforcement learning models I see on Youtube?
    I have seen videos like this: https://www.youtube.com/watch?v=Dw3BZ6O_8LY and I'm curious as to how someone would do this. Some classmates and I want to recreate a board game and train a model on a digital recreation of it, but we're not sure where to start. Any help is useful! submitted by /u/Aboudi556 [link] [comments]
  • Open

    They found a new NeRF technique to turn videos into controllable 3D models
    The key challenge is that NeRFs typically require multiple view images to reconstruct a scene in 3D, whereas videos provide only a single view over time. But that means we have to capture a lot of data to create a NeRF. What if there was a way to create 3D animated models of humans from monocular video footage using NeRFs? A new paper addresses this with a novel approach. First, they fit a parametric model (SMPL) to align with the subject in each frame of the video. This provides an initial estimate of the 3D shape. Second, they transform the coordinate system of the NeRF based on the surface of the SMPL model. This involves projecting input points onto the model's surface and calculating distances to the surface. Third, they incorporate the SMPL model's joint rotations to animate it in a variety of poses based on the video. This adds important pose-dependent shape cues. Finally, they use a neural network module to further refine the coordinate transform, correcting any inaccuracies in the SMPL fit to ensure spatial alignments are accurate for rendering. In experiments, they demonstrate their method generates high-quality renderings of subjects in novel views and poses not seen in the original video footage. The results capture nuanced clothing and hair deformations in a pose-dependent way. There are some example photos in the article that really show this off. Limitations exist for handling extremely complex motions and generating detailed face/hand geometry from low-resolution videos. But overall, the technique significantly advances the state-of-the-art in reconstructing animatable human models from monocular video. TLDR: They found a new NeRF technique to turn videos into controllable 3D models Full paper summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]
    The Biggest Update for ChatGPT
    submitted by /u/Senior_tasteey [link] [comments]
    Report finds that people are comfortable using AI for menial tasks at work but they would much rather rely on or engage with a human for the more personal and subjective aspects of their job.
    submitted by /u/Sy3Zy3Gy3 [link] [comments]
    cosmictrip.space updated with DALL-E 3: space and sci-fi images generated by prompts written by GPT-4
    submitted by /u/cryptoz [link] [comments]
    Is there a way to AI cover 60+ minutes of voice at one time?
    I'm in a pickle.. I want to make a podcast.. I don't want my voice to be public on it (but a friend of mine is willing to let me use them as a model).. And I don't have to cut audio into little clips, run them through the software that covers the audio, and restitch it back into one big file.. Is there any way to overcome all those dilemmas or am I just gonna have to deal with a few? Just figured I'd ask if anyone had any experience with ai voice podcasts just in case. submitted by /u/spazzyvomit916 [link] [comments]
    Trump versus Human-Sized Crab
    submitted by /u/co8222 [link] [comments]
    The first AI nation? A ship with 10,000 Nvidia H100 GPUs worth $500 million could become the first ever sovereign territory that relies entirely on artificial intelligence for its future | TechRadar
    . submitted by /u/AminoOxi [link] [comments]
    From RNNs to GPT4 - 10 years of NLP research explained in 50 concepts
    submitted by /u/AvvYaa [link] [comments]
    "The Dark Reality of Artificial Intelligence: Navigating Ethical, Societal, and Existential Challenges"
    submitted by /u/Fit-Code-5141 [link] [comments]
    Need help for NSFW Ai CHATBOT
    I Know this is a stupid question but I tried services like crushonai and more but they all require premium memberships. Is there anyway to use them completely free or any alternatives as I am in uni but live off my parents. Please help me! submitted by /u/RLIIDarK [link] [comments]
    Best Ai for generating repeating patterns?
    I’ve been testing different stuff, Craiyon does an excellent job but it’s not “private”. Dalle 3 is good but not great, Midjourney is meh (and or I just haven’t become too acquainted with it yet/don’t know any tricks). Do y’all have anything that you recommend?? Pattern examples— Wallpaper, wrapping papers, indigenous blankets, scrapbooking textures, Mossy Oak/Real Tree camo, interior design tiles, flooring, carpet samples submitted by /u/Maelasae [link] [comments]
    GREAT ANSWER FROM THE DUDE
    submitted by /u/the_anonymizer [link] [comments]
    AI spreadsheet app?
    So I am searching for an app which helps me editing a product sheet. In contains 500+ products but i need to edit the title or well delete one word and optimize the product description. What would be good? submitted by /u/MannyRibera32 [link] [comments]
    Artificial Intelligence Clashes with Copyright
    Artificial intelligence (AI) is causing clashes with copyright laws as it absorbs protected creations without authorization. Artists have complained about the theft of their work for training AI to imitate creators. The EU is preparing regulations to address this issue, while the UN held its first conference on the impact of AI. The question arises: is AI robbing artists of their work? AI poses a threat to intellectual property rights, with artists experiencing theft of their work for AI training purposes. Stephen Fry and Scarlett Johansson are among the artists who have taken legal action against unauthorized use of their voice and image by AI applications. The EU is working on regulations to make AI technology fairer, including in the creative field. AI's impact on copyright raises concerns about the loss of credit and compensation for creators. AI-generated works also raise questions about plagiarism and the economic benefit for companies. Future EU regulations aim to register generative AI systems and ensure transparency in data usage. Source : https://english.elpais.com/culture/2023-11-06/artificial-intelligence-clashes-with-copyright-is-it-stealing-thousands-of-protected-creations.html submitted by /u/NuseAI [link] [comments]
    Grandma's house, my childhood
    submitted by /u/Sea_Permit5660 [link] [comments]
    Hobbi project - Face Occlusion Detector
    submitted by /u/Gloomy_Recognition_4 [link] [comments]
    OpenAI unveils personalized AI apps as it seeks to expand its ChatGPT consumer business
    submitted by /u/donutloop [link] [comments]
    One-Minute Daily AI News 11/6/2023
    Microsoft-backed OpenAI announces GPT-4 Turbo, its most powerful AI yet.[1] OpenAI offers to pay for ChatGPT customers’ copyright lawsuits.[2] Artificial intelligence start-up OpenAI laid out an ambitious vision for expanding its business selling directly to consumers, unveiling Monday an app-store-like marketplace where users will get paid for making chatbots on the company’s technology.[3] At OpenAI’s first developer conference, the company behind ChatGPT announced new tools that let anyone create a customized chatbot or AI agent—no coding skills required.[4] Sources: [1] https://www.cnbc.com/2023/11/06/openai-announces-more-powerful-gpt-4-turbo-and-cuts-prices.html [2] https://www.theguardian.com/technology/2023/nov/06/openai-chatgpt-customers-copyright-lawsuits [3] https://www.washingtonpost.com/technology/2023/11/06/openai-app-store-chat-gptstore/ [4] https://www.wired.com/story/openai-wants-everyone-to-build-their-own-version-of-chatgpt/ submitted by /u/Excellent-Target-847 [link] [comments]
    Whatsapp @meta AI was just released, aaand, apparently Zuck did not call anyone Dumb F__ks!
    ​ looks like the 'hand of god' was cooking the books here. https://preview.redd.it/p64tjmcyouyb1.png?width=543&format=png&auto=webp&s=627b5b7f05311a5fd0687fee20ff990b6c681a6c submitted by /u/jazz788 [link] [comments]
  • Open

    [Project] I recently completed this autonomous racing car project using DRL and mlagents toolkit.
    ​ You can checkout the demo or play the game here: https://sookeyy.itch.io/neuralnitro submitted by /u/Sookeyy [link] [comments]
    Model-based methods that don't learn Gaussians?
    I've come across a few model-based methods in continuous state spaces and the model is always a Gaussian. (In many cases, the environment itself is actually deterministic, but thats a story for another day.) Are there significant papers trying to make more powerful models work? Are there even problem settings where this is useful? I'd assume a decent starting point to model more complicated transitions is to use a noise-conditioned network, like in distributional RL. Maybe people use mixture of Gaussians, but I don't find that particularly satisfying. submitted by /u/_An_Other_Account_ [link] [comments]
    What is the difference between the RL environments with or without the terminated function?
    Hi! Recently, I read some examples of gymnasium environments and found that the terminated function is not defined in all environments. According to my understanding, the terminated function is used to terminate the current episode when the goal has been achieved. In some environments without the terminated function, will the episode continue to roll even though the goal has been achieved? Besides, some environments seem to omit the definition of the terminated function. Thus, I'm confused about the necessity of this function/property. Hope can get a detailed explanation, and that would be greatly appreciated! submitted by /u/UpperSearch4172 [link] [comments]
    After david silver's course
    Can i just dive into reading papers? Without berkeley's DRL course? submitted by /u/RealJuney [link] [comments]
  • Open

    Alternating updates for efficient transformers
    Posted by Xin Wang, Software Engineer, and Nishanth Dikkala, Research Scientist, Google Research Contemporary deep learning models have been remarkably successful in many domains, ranging from natural language to computer vision. Transformer neural networks (transformers) are a popular deep learning architecture that today comprise the foundation for most tasks in natural language processing and also are starting to extend to applications in other domains, such as computer vision, robotics, and autonomous driving. Moreover, they form the backbone of all the current state-of-the-art language models. Increasing scale in Transformer networks has led to improved performance and the emergence of behavior not present in smaller networks. However, this increase in scale often comes with pr…  ( 93 min )
  • Open

    DSC Weekly 7 November 2023
    Announcements Top Stories In-Depth The post DSC Weekly 7 November 2023 appeared first on Data Science Central.  ( 20 min )
    Artificial Intelligence: Making waves in the stock market
    Stock trading is an industry that has greatly benefited from artificial intelligence (AI). AI algorithms in stock trading have enabled traders to make better decisions and enhance their trading strategies leading to increased profits and reduced risks. AI algorithms utilize machine learning to analyze large quantities of financial data including historical stock prices, company financial… Read More »Artificial Intelligence: Making waves in the stock market The post Artificial Intelligence: Making waves in the stock market appeared first on Data Science Central.  ( 22 min )
    Regulatory Capture: Why AI regulation favours the incumbents
    We are seeing a flurry of regulation  But we should ask ourselves if we are seeing regulatory capture — ie letting corporations write lax rules that lead to public harm. Andrew Ng points out some contradictions: “It’s also a mistake to set reporting requirements based on a computation threshold for model training. This will stifle… Read More »Regulatory Capture: Why AI regulation favours the incumbents The post Regulatory Capture: Why AI regulation favours the incumbents appeared first on Data Science Central.  ( 20 min )
    Andrew Ng for President?
    We are seeing a bunch of AI regulatory announcements from both sides of the Atlantic  President Biden with his executive order on safe, secure and trustworthy AI, and Prime Minister Rishi Sunak on his AI safety summit (which is a closed door event). Both these initiatives are dominated by a specific view i.e. safety security and risk Against… Read More »Andrew Ng for President? The post Andrew Ng for President? appeared first on Data Science Central.  ( 20 min )
    Edge AI’s next phase
    A podcast with Fabrizio del Maffeo of Axelera AI As CEO and Co-founder of Axelera AI, Fabrizio del Maffeo is focused on harnessing the potential of the fifth generation of RISC (reduced instruction set computing) platforms in edge AI applications. Axelera announced early availability of its in-memory Metis AI platform in September 2023. Edge AI… Read More »Edge AI’s next phase The post Edge AI’s next phase appeared first on Data Science Central.  ( 20 min )
  • Open

    Harnessing the power of enterprise data with generative AI: Insights from Amazon Kendra, LangChain, and large language models
    Large language models (LLMs) with their broad knowledge, can generate human-like text on almost any topic. However, their training on massive datasets also limits their usefulness for specialized tasks. Without continued learning, these models remain oblivious to new data and trends that emerge after their initial training. Furthermore, the cost to train new LLMs can […]  ( 14 min )
  • Open

    Toward developing faster algorithms for minimizing submodular functions
    This research paper was presented at the 64th IEEE Symposium on Foundations of Computer Science (FOCS) 2023 (opens in new tab), a premier forum for the latest research in theoretical computer science. Submodular functions are versatile mathematical tools, finding diverse applications in real-world scenarios and guiding solutions across complex domains. From dissecting the intricate networks […] The post Toward developing faster algorithms for minimizing submodular functions appeared first on Microsoft Research.  ( 10 min )
  • Open

    Digital Artist Steven Tung Shows Off So-fish-ticated Style This Week ‘In the NVIDIA Studio’
    Taiwanese artist Steven Tung creates captivating 2D and 3D digital art that explores sci-fi, minimalism and realism and pushes artistic boundaries.  ( 6 min )
    Digital Artist Steven Tung Shows Off So-fish-ticated Style This Week ‘In the NVIDIA Studio’
    Taiwanese artist Steven Tung creates captivating 2D and 3D digital art that explores sci-fi, minimalism and realism and pushes artistic boundaries.  ( 6 min )
  • Open

    Scalable Transformer for PDE Surrogate Modeling. (arXiv:2305.17560v2 [cs.LG] UPDATED)
    Transformer has shown state-of-the-art performance on various applications and has recently emerged as a promising tool for surrogate modeling of partial differential equations (PDEs). Despite the introduction of linear-complexity attention, applying Transformer to problems with a large number of grid points can be numerically unstable and computationally expensive. In this work, we propose Factorized Transformer (FactFormer), which is based on an axial factorized kernel integral. Concretely, we introduce a learnable projection operator that decomposes the input function into multiple sub-functions with one-dimensional domain. These sub-functions are then evaluated and used to compute the instance-based kernel with an axial factorized scheme. We showcase that the proposed model is able to simulate 2D Kolmogorov flow on a $256\times 256$ grid and 3D smoke buoyancy on a $64\times64\times64$ grid with good accuracy and efficiency. The proposed factorized scheme can serve as a computationally efficient low-rank surrogate for the full attention scheme when dealing with multi-dimensional problems.  ( 2 min )
    Graph Neural Networks with polynomial activations have limited expressivity. (arXiv:2310.13139v2 [cs.LG] UPDATED)
    The expressivity of Graph Neural Networks (GNNs) can be entirely characterized by appropriate fragments of the first-order logic. Namely, any query of the two variable fragment of graded modal logic (GC2) interpreted over labeled graphs can be expressed using a GNN whose size depends only on the depth of the query. As pointed out by [Barcelo & Al., 2020, Grohe, 2021], this description holds for a family of activation functions, leaving the possibility for a hierarchy of logics expressible by GNNs depending on the chosen activation function. In this article, we show that such hierarchy indeed exists by proving that GC2 queries cannot be expressed by GNNs with polynomial activation functions. This implies a separation between polynomial and popular non-polynomial activations (such as ReLUs, sigmoid and hyperbolic tan and others) and answers an open question formulated by [Grohe, 2021].  ( 2 min )
    Separable PINN: Mitigating the Curse of Dimensionality in Physics-Informed Neural Networks. (arXiv:2211.08761v3 [cs.LG] UPDATED)
    Physics-informed neural networks (PINNs) have emerged as new data-driven PDE solvers for both forward and inverse problems. While promising, the expensive computational costs to obtain solutions often restrict their broader applicability. We demonstrate that the computations in automatic differentiation (AD) can be significantly reduced by leveraging forward-mode AD when training PINN. However, a naive application of forward-mode AD to conventional PINNs results in higher computation, losing its practical benefit. Therefore, we propose a network architecture, called separable PINN (SPINN), which can facilitate forward-mode AD for more efficient computation. SPINN operates on a per-axis basis instead of point-wise processing in conventional PINNs, decreasing the number of network forward passes. Besides, while the computation and memory costs of standard PINNs grow exponentially along with the grid resolution, that of our model is remarkably less susceptible, mitigating the curse of dimensionality. We demonstrate the effectiveness of our model in various PDE systems by significantly reducing the training run-time while achieving comparable accuracy. Project page: https://jwcho5576.github.io/spinn/  ( 2 min )
    Prompt Engineering Through the Lens of Optimal Control. (arXiv:2310.14201v2 [cs.LG] UPDATED)
    Prompt Engineering (PE) has emerged as a critical technique for guiding Large Language Models (LLMs) in solving intricate tasks. Its importance is highlighted by its potential to significantly enhance the efficiency and effectiveness of human-machine interaction. As tasks grow increasingly complex, recent advanced PE methods have extended beyond the limitations of single-round interactions to embrace multi-round interactions, which allows for a deeper and more nuanced engagement with LLMs. In this paper, we propose an optimal control framework tailored for multi-round interactions with LLMs. This framework provides a unified mathematical structure that not only systematizes the existing PE methods but also sets the stage for rigorous analytical improvements. Furthermore, we extend this framework to include PE via ensemble methods and multi-agent collaboration, thereby enlarging the scope of applicability. By adopting an optimal control perspective, we offer fresh insights into existing PE methods and highlight theoretical challenges that warrant future research. Besides, our work lays a foundation for the development of more effective and interpretable PE methods.  ( 2 min )
    Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference. (arXiv:2305.13484v3 [cs.DC] UPDATED)
    Autoregressive models, despite their commendable performance in a myriad of generative tasks, face challenges stemming from their inherently sequential structure. Inference on these models, by design, harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens. This inherent characteristic severely impedes computational efficiency during inference as a typical inference request can require more than thousands of tokens, where generating each token requires a load of entire model weights, making the inference more memory-bound. The large overhead becomes profound in real deployment where requests arrive randomly, necessitating various generation lengths. Existing solutions, such as dynamic batching and concurrent instances, introduce significant response delays and bandwidth contention, falling short of achieving optimal latency and throughput. To address these shortcomings, we propose Flover -- a temporal fusion framework for efficiently inferring multiple requests in parallel. We deconstruct the general generation pipeline into pre-processing and token generation, and equip the framework with a dedicated work scheduler for fusing the generation process temporally across all requests. By orchestrating the token-level parallelism, Flover exhibits optimal hardware efficiency and significantly spares the system resources. By further employing a fast buffer reordering algorithm that allows memory eviction of finished tasks, it brings over 11x inference speedup on GPT and 16x on LLAMA compared to the cutting-edge solutions provided by NVIDIA FasterTransformer. Crucially, by leveraging the advanced tensor parallel technique, Flover proves efficacious across diverse computational landscapes, from single-GPU setups to distributed scenarios, thereby offering robust performance optimization that adapts to variable use cases.  ( 3 min )
    Bayes beats Cross Validation: Efficient and Accurate Ridge Regression via Expectation Maximization. (arXiv:2310.18860v2 [stat.ML] UPDATED)
    We present a novel method for tuning the regularization hyper-parameter, $\lambda$, of a ridge regression that is faster to compute than leave-one-out cross-validation (LOOCV) while yielding estimates of the regression parameters of equal, or particularly in the setting of sparse covariates, superior quality to those obtained by minimising the LOOCV risk. The LOOCV risk can suffer from multiple and bad local minima for finite $n$ and thus requires the specification of a set of candidate $\lambda$, which can fail to provide good solutions. In contrast, we show that the proposed method is guaranteed to find a unique optimal solution for large enough $n$, under relatively mild conditions, without requiring the specification of any difficult to determine hyper-parameters. This is based on a Bayesian formulation of ridge regression that we prove to have a unimodal posterior for large enough $n$, allowing for both the optimal $\lambda$ and the regression coefficients to be jointly learned within an iterative expectation maximization (EM) procedure. Importantly, we show that by utilizing an appropriate preprocessing step, a single iteration of the main EM loop can be implemented in $O(\min(n, p))$ operations, for input data with $n$ rows and $p$ columns. In contrast, evaluating a single value of $\lambda$ using fast LOOCV costs $O(n \min(n, p))$ operations when using the same preprocessing. This advantage amounts to an asymptotic improvement of a factor of $l$ for $l$ candidate values for $\lambda$ (in the regime $q, p \in O(\sqrt{n})$ where $q$ is the number of regression targets).  ( 3 min )
    Modeling Dynamics over Meshes with Gauge Equivariant Nonlinear Message Passing. (arXiv:2310.19589v2 [cs.LG] UPDATED)
    Data over non-Euclidean manifolds, often discretized as surface meshes, naturally arise in computer graphics and biological and physical systems. In particular, solutions to partial differential equations (PDEs) over manifolds depend critically on the underlying geometry. While graph neural networks have been successfully applied to PDEs, they do not incorporate surface geometry and do not consider local gauge symmetries of the manifold. Alternatively, recent works on gauge equivariant convolutional and attentional architectures on meshes leverage the underlying geometry but underperform in modeling surface PDEs with complex nonlinear dynamics. To address these issues, we introduce a new gauge equivariant architecture using nonlinear message passing. Our novel architecture achieves higher performance than either convolutional or attentional networks on domains with highly complex and nonlinear dynamics. However, similar to the non-mesh case, design trade-offs favor convolutional, attentional, or message passing networks for different tasks; we investigate in which circumstances our message passing method provides the most benefit.  ( 2 min )
    Finite-Time Logarithmic Bayes Regret Upper Bounds. (arXiv:2306.09136v2 [cs.LG] UPDATED)
    We derive the first finite-time logarithmic Bayes regret upper bounds for Bayesian bandits. In Gaussian bandits, we obtain $O(c_\Delta \log n)$ and $O(c_h \log^2 n)$ bounds for an upper confidence bound algorithm, where $c_h$ and $c_\Delta$ are constants depending on the prior distribution and the gaps of random bandit instances sampled from it, respectively. The latter bound asymptotically matches the lower bound of Lai (1987). Our proofs are a major technical departure from prior works, while being simple and general. To show the generality of our techniques, we apply them to linear bandits. Our results provide insights on the value of prior in the Bayesian setting, both in the objective and as a side information given to the learner. They significantly improve upon existing $\tilde{O}(\sqrt{n})$ bounds, which have become standard in the literature despite the existing lower bounds.  ( 2 min )
    Recognition of Unseen Bird Species by Learning from Field Guides. (arXiv:2206.01466v2 [cs.CV] UPDATED)
    We exploit field guides to learn bird species recognition, in particular zero-shot recognition of unseen species. Illustrations contained in field guides deliberately focus on discriminative properties of each species, and can serve as side information to transfer knowledge from seen to unseen bird species. We study two approaches: (1) a contrastive encoding of illustrations, which can be fed into standard zero-shot learning schemes; and (2) a novel method that leverages the fact that illustrations are also images and as such structurally more similar to photographs than other kinds of side information. Our results show that illustrations from field guides, which are readily available for a wide range of species, are indeed a competitive source of side information for zero-shot learning. On a subset of the iNaturalist2021 dataset with 749 seen and 739 unseen species, we obtain a classification accuracy of unseen bird species of $12\%$ @top-1 and $38\%$ @top-10, which shows the potential of field guides for challenging real-world scenarios with many species. Our code is available at https://github.com/ac-rodriguez/zsl_billow  ( 2 min )
    General Purpose Artificial Intelligence Systems (GPAIS): Properties, Definition, Taxonomy, Societal Implications and Responsible Governance. (arXiv:2307.14283v2 [cs.AI] UPDATED)
    Most applications of Artificial Intelligence (AI) are designed for a confined and specific task. However, there are many scenarios that call for a more general AI, capable of solving a wide array of tasks without being specifically designed for them. The term General-Purpose Artificial Intelligence Systems (GPAIS) has been defined to refer to these AI systems. To date, the possibility of an Artificial General Intelligence, powerful enough to perform any intellectual task as if it were human, or even improve it, has remained an aspiration, fiction, and considered a risk for our society. Whilst we might still be far from achieving that, GPAIS is a reality and sitting at the forefront of AI research. This work discusses existing definitions for GPAIS and proposes a new definition that allows for a gradual differentiation among types of GPAIS according to their properties and limitations. We distinguish between closed-world and open-world GPAIS, characterising their degree of autonomy and ability based on several factors such as adaptation to new tasks, competence in domains not intentionally trained for, ability to learn from few data, or proactive acknowledgment of their own limitations. We propose a taxonomy of approaches to realise GPAIS, describing research trends such as the use of AI techniques to improve another AI (AI-powered AI) or (single) foundation models. As a prime example, we delve into GenAI, aligning them with the concepts presented in the taxonomy. We explore multi-modality, which involves fusing various types of data sources to expand the capabilities of GPAIS. Through the proposed definition and taxonomy, our aim is to facilitate research collaboration across different areas that are tackling general purpose tasks, as they share many common aspects. Finally, we discuss the state of GPAIS, prospects, societal implications, and the need for regulation and governance.  ( 3 min )
    Numerical influence of ReLU'(0) on backpropagation. (arXiv:2106.12915v4 [cs.LG] UPDATED)
    In theory, the choice of ReLU(0) in [0, 1] for a neural network has a negligible influence both on backpropagation and training. Yet, in the real world, 32 bits default precision combined with the size of deep learning problems makes it a hyperparameter of training methods. We investigate the importance of the value of ReLU'(0) for several precision levels (16, 32, 64 bits), on various networks (fully connected, VGG, ResNet) and datasets (MNIST, CIFAR10, SVHN, ImageNet). We observe considerable variations of backpropagation outputs which occur around half of the time in 32 bits precision. The effect disappears with double precision, while it is systematic at 16 bits. For vanilla SGD training, the choice ReLU'(0) = 0 seems to be the most efficient. For our experiments on ImageNet the gain in test accuracy over ReLU'(0) = 1 was more than 10 points (two runs). We also evidence that reconditioning approaches as batch-norm or ADAM tend to buffer the influence of ReLU'(0)'s value. Overall, the message we convey is that algorithmic differentiation of nonsmooth problems potentially hides parameters that could be tuned advantageously.  ( 2 min )
    Learning COVID-19 Regional Transmission Using Universal Differential Equations in a SIR model. (arXiv:2310.16804v2 [cs.LG] UPDATED)
    Highly-interconnected societies difficult to model the spread of infectious diseases such as COVID-19. Single-region SIR models fail to account for incoming forces of infection and expanding them to a large number of interacting regions involves many assumptions that do not hold in the real world. We propose using Universal Differential Equations (UDEs) to capture the influence of neighboring regions and improve the model's predictions in a combined SIR+UDE model. UDEs are differential equations totally or partially defined by a deep neural network (DNN). We include an additive term to the SIR equations composed by a DNN that learns the incoming force of infection from the other regions. The learning is performed using automatic differentiation and gradient descent to approach the change in the target system caused by the state of the neighboring regions. We compared the proposed model using a simulated COVID-19 outbreak against a single-region SIR and a fully data-driven model composed only of a DNN. The proposed UDE+SIR model generates predictions that capture the outbreak dynamic more accurately, but a decay in performance is observed at the last stages of the outbreak. The single-area SIR and the fully data-driven approach do not capture the proper dynamics accurately. Once the predictions were obtained, we employed the SINDy algorithm to substitute the DNN with a regression, removing the black box element of the model with no considerable increase in the error levels.  ( 3 min )
    Learning Extrinsic Dexterity with Parameterized Manipulation Primitives. (arXiv:2310.17785v2 [cs.RO] UPDATED)
    Many practically relevant robot grasping problems feature a target object for which all grasps are occluded, e.g., by the environment. Single-shot grasp planning invariably fails in such scenarios. Instead, it is necessary to first manipulate the object into a configuration that affords a grasp. We solve this problem by learning a sequence of actions that utilize the environment to change the object's pose. Concretely, we employ hierarchical reinforcement learning to combine a sequence of learned parameterized manipulation primitives. By learning the low-level manipulation policies, our approach can control the object's state through exploiting interactions between the object, the gripper, and the environment. Designing such a complex behavior analytically would be infeasible under uncontrolled conditions, as an analytic approach requires accurate physical modeling of the interaction and contact dynamics. In contrast, we learn a hierarchical policy model that operates directly on depth perception data, without the need for object detection, pose estimation, or manual design of controllers. We evaluate our approach on picking box-shaped objects of various weight, shape, and friction properties from a constrained table-top workspace. Our method transfers to a real robot and is able to successfully complete the object picking task in 98\% of experimental trials.  ( 2 min )
    On minimizers and convolutional filters: theoretical connections and applications to genome analysis. (arXiv:2111.08452v5 [cs.LG] UPDATED)
    Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. Here, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step towards building a deep learning assembler, though it is at present too slow to be practical. In total, this manuscript provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.  ( 3 min )
    Detecting Pretraining Data from Large Language Models. (arXiv:2310.16789v2 [cs.CL] UPDATED)
    Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In this paper, we study the pretraining data detection problem: given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that uses data created before and after model training to support gold truth detection. We also introduce a new detection method Min-K% Prob based on a simple hypothesis: an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. Min-K% Prob can be applied without any knowledge about the pretraining corpus or any additional training, departing from previous detection methods that require training a reference model on data that is similar to the pretraining data. Moreover, our experiments demonstrate that Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous methods. We apply Min-K% Prob to three real-world scenarios, copyrighted book detection, contaminated downstream example detection and privacy auditing of machine unlearning, and find it a consistently effective solution.  ( 3 min )
    The language of prompting: What linguistic properties make a prompt successful?. (arXiv:2311.01967v1 [cs.CL])
    The latest generation of LLMs can be prompted to achieve impressive zero-shot or few-shot performance in many NLP tasks. However, since performance is highly sensitive to the choice of prompts, considerable effort has been devoted to crowd-sourcing prompts or designing methods for prompt optimisation. Yet, we still lack a systematic understanding of how linguistic properties of prompts correlate with task performance. In this work, we investigate how LLMs of different sizes, pre-trained and instruction-tuned, perform on prompts that are semantically equivalent, but vary in linguistic structure. We investigate both grammatical properties such as mood, tense, aspect and modality, as well as lexico-semantic variation through the use of synonyms. Our findings contradict the common assumption that LLMs achieve optimal performance on lower perplexity prompts that reflect language use in pretraining or instruction-tuning data. Prompts transfer poorly between datasets or models, and performance cannot generally be explained by perplexity, word frequency, ambiguity or prompt length. Based on our results, we put forward a proposal for a more robust and comprehensive evaluation standard for prompting research.  ( 2 min )
    Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review. (arXiv:2311.01918v1 [cs.CL])
    With the rapid development of artificial intelligence, large language models (LLMs) have shown promising capabilities in mimicking human-level language comprehension and reasoning. This has sparked significant interest in applying LLMs to enhance various aspects of healthcare, ranging from medical education to clinical decision support. However, medicine involves multifaceted data modalities and nuanced reasoning skills, presenting challenges for integrating LLMs. This paper provides a comprehensive review on the applications and implications of LLMs in medicine. It begins by examining the fundamental applications of general-purpose and specialized LLMs, demonstrating their utilities in knowledge retrieval, research support, clinical workflow automation, and diagnostic assistance. Recognizing the inherent multimodality of medicine, the review then focuses on multimodal LLMs, investigating their ability to process diverse data types like medical imaging and EHRs to augment diagnostic accuracy. To address LLMs' limitations regarding personalization and complex clinical reasoning, the paper explores the emerging development of LLM-powered autonomous agents for healthcare. Furthermore, it summarizes the evaluation methodologies for assessing LLMs' reliability and safety in medical contexts. Overall, this review offers an extensive analysis on the transformative potential of LLMs in modern medicine. It also highlights the pivotal need for continuous optimizations and ethical oversight before these models can be effectively integrated into clinical practice. Visit https://github.com/mingze-yuan/Awesome-LLM-Healthcare for an accompanying GitHub repository containing latest papers.  ( 3 min )
    Online non-parametric likelihood-ratio estimation by Pearson-divergence functional minimization. (arXiv:2311.01900v1 [stat.ML])
    Quantifying the difference between two probability density functions, $p$ and $q$, using available data, is a fundamental problem in Statistics and Machine Learning. A usual approach for addressing this problem is the likelihood-ratio estimation (LRE) between $p$ and $q$, which -- to our best knowledge -- has been investigated mainly for the offline case. This paper contributes by introducing a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t \sim p, x'_t \sim q)$ are observed over time. The non-parametric nature of our approach has the advantage of being agnostic to the forms of $p$ and $q$. Moreover, we capitalize on the recent advances in Kernel Methods and functional minimization to develop an estimator that can be efficiently updated online. We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.  ( 2 min )
    Advancing Bayesian Optimization via Learning Correlated Latent Space. (arXiv:2310.20258v2 [cs.LG] UPDATED)
    Bayesian optimization is a powerful method for optimizing black-box functions with limited function evaluations. Recent works have shown that optimization in a latent space through deep generative models such as variational autoencoders leads to effective and efficient Bayesian optimization for structured or discrete data. However, as the optimization does not take place in the input space, it leads to an inherent gap that results in potentially suboptimal solutions. To alleviate the discrepancy, we propose Correlated latent space Bayesian Optimization (CoBO), which focuses on learning correlated latent spaces characterized by a strong correlation between the distances in the latent space and the distances within the objective function. Specifically, our method introduces Lipschitz regularization, loss weighting, and trust region recoordination to minimize the inherent gap around the promising areas. We demonstrate the effectiveness of our approach on several optimization tasks in discrete data, such as molecule design and arithmetic expression fitting, and achieve high performance within a small budget.
    Learning UI-to-Code Reverse Generator Using Visual Critic Without Rendering. (arXiv:2305.14637v2 [cs.CV] UPDATED)
    Automated reverse engineering of HTML/CSS code from UI screenshots is an important yet challenging problem with broad applications in website development and design. In this paper, we propose a novel vision-code transformer (ViCT) composed of a vision encoder processing the screenshots and a language decoder to generate the code. They are initialized by pre-trained models such as ViT/DiT and GPT-2/LLaMA but aligning the two modalities requires end-to-end finetuning, which aims to minimize the visual discrepancy between the code-rendered webpage and the original screenshot. However, the rendering is non-differentiable and causes costly overhead. We address this problem by actor-critic fine-tuning where a visual critic without rendering (ViCR) is developed to predict visual discrepancy given the original and generated code. To train and evaluate our models, we created two synthetic datasets of varying complexity, with over 75,000 unique (code, screenshot) pairs. We evaluate the UI-to-Code performance using a combination of automated metrics such as MSE, BLEU, IoU, and a novel htmlBLEU score. ViCT outperforms a strong baseline model DiT-GPT2, improving IoU from 0.64 to 0.79 and lowering MSE from 12.25 to 9.02. With much lower computational cost, it can achieve comparable performance as when using a larger decoder such as LLaMA.  ( 2 min )
    Dropout Strategy in Reinforcement Learning: Limiting the Surrogate Objective Variance in Policy Optimization Methods. (arXiv:2310.20380v3 [cs.LG] UPDATED)
    Policy-based reinforcement learning algorithms are widely used in various fields. Among them, mainstream policy optimization algorithms such as TRPO and PPO introduce importance sampling into policy iteration, which allows the reuse of historical data. However, this can also lead to a high variance of the surrogate objective and indirectly affects the stability and convergence of the algorithm. In this paper, we first derived an upper bound of the surrogate objective variance, which can grow quadratically with the increase of the surrogate objective. Next, we proposed the dropout technique to avoid the excessive increase of the surrogate objective variance caused by importance sampling. Then, we introduced a general reinforcement learning framework applicable to mainstream policy optimization methods, and applied the dropout technique to the PPO algorithm to obtain the D-PPO variant. Finally, we conduct comparative experiments between D-PPO and PPO algorithms in the Atari 2600 environment, and the results show that D-PPO achieved significant performance improvements compared to PPO, and effectively limited the excessive increase of the surrogate objective variance during training.
    CapsFusion: Rethinking Image-Text Data at Scale. (arXiv:2310.20550v2 [cs.CV] UPDATED)
    Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute fundamentally to this success, but suffer from excessive noise. Recent studies use alternative captions synthesized by captioning models and have achieved notable benchmark performance. However, our experiments reveal significant Scalability Deficiency and World Knowledge Loss issues in models trained with synthetic captions, which have been largely obscured by their initial benchmark success. Upon closer examination, we identify the root cause as the overly-simplified language structure and lack of knowledge details in existing synthetic captions. To provide higher-quality and more scalable multimodal pretraining data, we propose CapsFusion, an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions. Extensive experiments show that CapsFusion captions exhibit remarkable all-round superiority over existing captions in terms of model performance (e.g., 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps), sample efficiency (requiring 11-16 times less computation than baselines), world knowledge depth, and scalability. These effectiveness, efficiency and scalability advantages position CapsFusion as a promising candidate for future scaling of LMM training.
    Ensemble models outperform single model uncertainties and predictions for operator-learning of hypersonic flows. (arXiv:2311.00060v2 [physics.flu-dyn] UPDATED)
    High-fidelity computational simulations and physical experiments of hypersonic flows are resource intensive. Training scientific machine learning (SciML) models on limited high-fidelity data offers one approach to rapidly predict behaviors for situations that have not been seen before. However, high-fidelity data is itself in limited quantity to validate all outputs of the SciML model in unexplored input space. As such, an uncertainty-aware SciML model is desired. The SciML model's output uncertainties could then be used to assess the reliability and confidence of the model's predictions. In this study, we extend a DeepONet using three different uncertainty quantification mechanisms: mean-variance estimation, evidential uncertainty, and ensembling. The uncertainty aware DeepONet models are trained and evaluated on the hypersonic flow around a blunt cone object with data generated via computational fluid dynamics over a wide range of Mach numbers and altitudes. We find that ensembling outperforms the other two uncertainty models in terms of minimizing error and calibrating uncertainty in both interpolative and extrapolative regimes.
    Domain Randomization via Entropy Maximization. (arXiv:2311.01885v1 [cs.LG])
    Varying dynamics parameters in simulation is a popular Domain Randomization (DR) approach for overcoming the reality gap in Reinforcement Learning (RL). Nevertheless, DR heavily hinges on the choice of the sampling distribution of the dynamics parameters, since high variability is crucial to regularize the agent's behavior but notoriously leads to overly conservative policies when randomizing excessively. In this paper, we propose a novel approach to address sim-to-real transfer, which automatically shapes dynamics distributions during training in simulation without requiring real-world data. We introduce DOmain RAndomization via Entropy MaximizatiON (DORAEMON), a constrained optimization problem that directly maximizes the entropy of the training distribution while retaining generalization capabilities. In achieving this, DORAEMON gradually increases the diversity of sampled dynamics parameters as long as the probability of success of the current policy is sufficiently high. We empirically validate the consistent benefits of DORAEMON in obtaining highly adaptive and generalizable policies, i.e. solving the task at hand across the widest range of dynamics parameters, as opposed to representative baselines from the DR literature. Notably, we also demonstrate the Sim2Real applicability of DORAEMON through its successful zero-shot transfer in a robotic manipulation setup under unknown real-world parameters.  ( 2 min )
    Managing AI Risks in an Era of Rapid Progress. (arXiv:2310.17688v1 [cs.CY] CROSS LISTED)
    In this short consensus paper, we outline risks from upcoming, advanced AI systems. We examine large-scale social harms and malicious uses, as well as an irreversible loss of human control over autonomous AI systems. In light of rapid and continuing AI progress, we propose priorities for AI R&D and governance.
    When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment. (arXiv:2307.03864v4 [cs.LG] UPDATED)
    Reinforcement learning (RL) algorithms face two distinct challenges: learning effective representations of past and present observations, and determining how actions influence future returns. Both challenges involve modeling long-term dependencies. The Transformer architecture has been very successful to solve problems that involve long-term dependencies, including in the RL domain. However, the underlying reason for the strong performance of Transformer-based RL methods remains unclear: is it because they learn effective memory, or because they perform effective credit assignment? After introducing formal definitions of memory length and credit assignment length, we design simple configurable tasks to measure these distinct quantities. Our empirical results reveal that Transformers can enhance the memory capability of RL algorithms, scaling up to tasks that require memorizing observations $1500$ steps ago. However, Transformers do not improve long-term credit assignment. In summary, our results provide an explanation for the success of Transformers in RL, while also highlighting an important area for future research and benchmark design. Our code is open-sourced at https://github.com/twni2016/Memory-RL  ( 2 min )
    LLQL: Logistic Likelihood Q-Learning for Reinforcement Learning. (arXiv:2307.02345v3 [cs.LG] UPDATED)
    Modern reinforcement learning (RL) can be categorized into online and offline variants. As a pivotal aspect of both online and offline RL, current research on the Bellman equation revolves primarily around optimization techniques and performance enhancement rather than exploring the inherent structural properties of the Bellman error, such as its distribution characteristics. This study investigates the distribution of the Bellman approximation error in both online and offline settings through iterative exploration of the Bellman equation. We observed that both in online RL and offline RL, the Bellman error conforms to a Logistic distribution. Building upon this discovery, this study employed the Logistics maximum likelihood function (LLoss) as an alternative to the commonly used MSE Loss, assuming that Bellman errors adhere to a normal distribution. We validated our hypotheses through extensive numerical experiments across diverse online and offline environments. In particular, we applied corrections to the loss function across various baseline algorithms and consistently observed that the loss function with Logistic corrections outperformed the MSE counterpart significantly. Additionally, we conducted Kolmogorov-Smirnov tests to confirm the reliability of the Logistic distribution. This study's theoretical and empirical insights provide valuable groundwork for future investigations and enhancements centered on the distribution of Bellman errors.  ( 2 min )
    Improving Intrinsic Exploration by Creating Stationary Objectives. (arXiv:2310.18144v2 [cs.LG] UPDATED)
    Exploration bonuses in reinforcement learning guide long-horizon exploration by defining custom intrinsic objectives. Count-based methods use the frequency of state visits to derive an exploration bonus. In this paper, we identify that any intrinsic reward function derived from count-based methods is non-stationary and hence induces a difficult objective to optimize for the agent. The key contribution of our work lies in transforming the original non-stationary rewards into stationary rewards through an augmented state representation. For this purpose, we introduce the Stationary Objectives For Exploration (SOFE) framework. SOFE requires identifying sufficient statistics for different exploration bonuses and finding an efficient encoding of these statistics to use as input to a deep network. SOFE is based on proposing state augmentations that expand the state space but hold the promise of simplifying the optimization of the agent's objective. Our experiments show that SOFE improves the agents' performance in challenging exploration problems, including sparse-reward tasks, pixel-based observations, 3D navigation, and procedurally generated environments.  ( 2 min )
    CDGraph: Dual Conditional Social Graph Synthesizing via Diffusion Model. (arXiv:2311.01729v1 [cs.SI])
    The social graphs synthesized by the generative models are increasingly in demand due to data scarcity and concerns over user privacy. One of the key performance criteria for generating social networks is the fidelity to specified conditionals, such as users with certain membership and financial status. While recent diffusion models have shown remarkable performance in generating images, their effectiveness in synthesizing graphs has not yet been explored in the context of conditional social graphs. In this paper, we propose the first kind of conditional diffusion model for social networks, CDGraph, which trains and synthesizes graphs based on two specified conditions. We propose the co-evolution dependency in the denoising process of CDGraph to capture the mutual dependencies between the dual conditions and further incorporate social homophily and social contagion to preserve the connectivity between nodes while satisfying the specified conditions. Moreover, we introduce a novel classifier loss, which guides the training of the diffusion process through the mutual dependency of dual conditions. We evaluate CDGraph against four existing graph generative methods, i.e., SPECTRE, GSM, EDGE, and DiGress, on four datasets. Our results show that the generated graphs from CDGraph achieve much higher dual-conditional validity and lower discrepancy in various social network metrics than the baselines, thus demonstrating its proficiency in generating dual-conditional social graphs.  ( 2 min )
    Guiding Language Models of Code with Global Context using Monitors. (arXiv:2306.10763v2 [cs.CL] UPDATED)
    Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating. Integrated development environments (IDEs) assist developers in understanding repository context using static analysis. We extend this assistance, enjoyed by developers, to LMs. We propose monitor-guided decoding (MGD) where a monitor uses static analysis to guide the decoding. We construct a repository-level dataset PragmaticCode for method-completion in Java and evaluate MGD on it. On models of varying parameter scale, by monitoring for type-consistent object dereferences, MGD consistently improves compilation rates and agreement with ground truth. Further, LMs with fewer parameters, when augmented with MGD, can outperform larger LMs. With MGD, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model. We also conduct a generalizability study to evaluate the ability of MGD to generalize to multiple programming languages (Java, C# and Rust), coding scenarios (e.g., correct number of arguments to method calls), and to enforce richer semantic constraints (e.g., stateful API protocols). Our data and implementation are available at https://github.com/microsoft/monitors4codegen .  ( 3 min )
    Provably Convergent Data-Driven Convex-Nonconvex Regularization. (arXiv:2310.05812v2 [cs.LG] UPDATED)
    An emerging new paradigm for solving inverse problems is via the use of deep learning to learn a regularizer from data. This leads to high-quality results, but often at the cost of provable guarantees. In this work, we show how well-posedness and convergent regularization arises within the convex-nonconvex (CNC) framework for inverse problems. We introduce a novel input weakly convex neural network (IWCNN) construction to adapt the method of learned adversarial regularization to the CNC framework. Empirically we show that our method overcomes numerical issues of previous adversarial methods.  ( 2 min )
    General Anomaly Detection of Underwater Gliders Validated by Large-scale Deployment Datasets. (arXiv:2308.00180v3 [cs.RO] UPDATED)
    Underwater gliders have been widely used in oceanography for a range of applications. However, unpredictable events like shark strikes or remora attachments can lead to abnormal glider behavior or even loss of the instrument. This paper employs an anomaly detection algorithm to assess operational conditions of underwater gliders in the real-world ocean environment. Prompt alerts are provided to glider pilots upon detecting any anomaly, so that they can take control of the glider to prevent further harm. The detection algorithm is applied to multiple datasets collected in real glider deployments led by the University of Georgia's Skidaway Institute of Oceanography (SkIO) and the University of South Florida (USF). In order to demonstrate the algorithm generality, the experimental evaluation is applied to four glider deployment datasets, each highlighting various anomalies happening in different scenes. Specifically, we utilize high resolution datasets only available post-recovery to perform detailed analysis of the anomaly and compare it with pilot logs. Additionally, we simulate the online detection based on the real-time subsets of data transmitted from the glider at the surfacing events. While the real-time data may not contain as much rich information as the post-recovery one, the online detection is of great importance as it allows glider pilots to monitor potential abnormal conditions in real time.  ( 3 min )
    Learning nonparametric latent causal graphs with unknown interventions. (arXiv:2306.02899v2 [stat.ML] UPDATED)
    We establish conditions under which latent causal graphs are nonparametrically identifiable and can be reconstructed from unknown interventions in the latent space. Our primary focus is the identification of the latent structure in measurement models without parametric assumptions such as linearity or Gaussianity. Moreover, we do not assume the number of hidden variables is known, and we show that at most one unknown intervention per hidden variable is needed. This extends a recent line of work on learning causal representations from observations and interventions. The proofs are constructive and introduce two new graphical concepts -- imaginary subsets and isolated edges -- that may be useful in their own right. As a matter of independent interest, the proofs also involve a novel characterization of the limits of edge orientations within the equivalence class of DAGs induced by unknown interventions. These are the first results to characterize the conditions under which causal representations are identifiable without making any parametric assumptions in a general setting with unknown interventions and without faithfulness.  ( 2 min )
    DeliverAI: Reinforcement Learning Based Distributed Path-Sharing Network for Food Deliveries. (arXiv:2311.02017v1 [cs.LG])
    Delivery of items from the producer to the consumer has experienced significant growth over the past decade and has been greatly fueled by the recent pandemic. Amazon Fresh, Shopify, UberEats, InstaCart, and DoorDash are rapidly growing and are sharing the same business model of consumer items or food delivery. Existing food delivery methods are sub-optimal because each delivery is individually optimized to go directly from the producer to the consumer via the shortest time path. We observe a significant scope for reducing the costs associated with completing deliveries under the current model. We model our food delivery problem as a multi-objective optimization, where consumer satisfaction and delivery costs, both, need to be optimized. Taking inspiration from the success of ride-sharing in the taxi industry, we propose DeliverAI - a reinforcement learning-based path-sharing algorithm. Unlike previous attempts for path-sharing, DeliverAI can provide real-time, time-efficient decision-making using a Reinforcement learning-enabled agent system. Our novel agent interaction scheme leverages path-sharing among deliveries to reduce the total distance traveled while keeping the delivery completion time under check. We generate and test our methodology vigorously on a simulation setup using real data from the city of Chicago. Our results show that DeliverAI can reduce the delivery fleet size by 12\%, the distance traveled by 13%, and achieve 50% higher fleet utilization compared to the baselines.
    Anytime-Competitive Reinforcement Learning with Policy Prior. (arXiv:2311.01568v1 [cs.LG])
    This paper studies the problem of Anytime-Competitive Markov Decision Process (A-CMDP). Existing works on Constrained Markov Decision Processes (CMDPs) aim to optimize the expected reward while constraining the expected cost over random dynamics, but the cost in a specific episode can still be unsatisfactorily high. In contrast, the goal of A-CMDP is to optimize the expected reward while guaranteeing a bounded cost in each round of any episode against a policy prior. We propose a new algorithm, called Anytime-Competitive Reinforcement Learning (ACRL), which provably guarantees the anytime cost constraints. The regret analysis shows the policy asymptotically matches the optimal reward achievable under the anytime competitive constraints. Experiments on the application of carbon-intelligent computing verify the reward performance and cost constraint guarantee of ACRL.
    A large-scale and PCR-referenced vocal audio dataset for COVID-19. (arXiv:2212.07738v4 [cs.SD] UPDATED)
    The UK COVID-19 Vocal Audio Dataset is designed for the training and evaluation of machine learning models that classify SARS-CoV-2 infection status or associated respiratory symptoms using vocal audio. The UK Health Security Agency recruited voluntary participants through the national Test and Trace programme and the REACT-1 survey in England from March 2021 to March 2022, during dominant transmission of the Alpha and Delta SARS-CoV-2 variants and some Omicron variant sublineages. Audio recordings of volitional coughs, exhalations, and speech were collected in the 'Speak up to help beat coronavirus' digital survey alongside demographic, self-reported symptom and respiratory condition data, and linked to SARS-CoV-2 test results. The UK COVID-19 Vocal Audio Dataset represents the largest collection of SARS-CoV-2 PCR-referenced audio recordings to date. PCR results were linked to 70,794 of 72,999 participants and 24,155 of 25,776 positive cases. Respiratory symptoms were reported by 45.62% of participants. This dataset has additional potential uses for bioacoustics research, with 11.30% participants reporting asthma, and 27.20% with linked influenza PCR test results.
    Improving Fairness using Vision-Language Driven Image Augmentation. (arXiv:2311.01573v1 [cs.CV])
    Fairness is crucial when training a deep-learning discriminative model, especially in the facial domain. Models tend to correlate specific characteristics (such as age and skin color) with unrelated attributes (downstream tasks), resulting in biases which do not correspond to reality. It is common knowledge that these correlations are present in the data and are then transferred to the models during training. This paper proposes a method to mitigate these correlations to improve fairness. To do so, we learn interpretable and meaningful paths lying in the semantic space of a pre-trained diffusion model (DiffAE) -- such paths being supervised by contrastive text dipoles. That is, we learn to edit protected characteristics (age and skin color). These paths are then applied to augment images to improve the fairness of a given dataset. We test the proposed method on CelebA-HQ and UTKFace on several downstream tasks with age and skin color as protected characteristics. As a proxy for fairness, we compute the difference in accuracy with respect to the protected characteristics. Quantitative results show how the augmented images help the model improve the overall accuracy, the aforementioned metric, and the disparity of equal opportunity. Code is available at: https://github.com/Moreno98/Vision-Language-Bias-Control.
    Convex and Non-convex Optimization Under Generalized Smoothness. (arXiv:2306.01264v2 [math.OC] UPDATED)
    Classical analysis of convex and non-convex optimization methods often requires the Lipshitzness of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting.
    ChatGPT for GTFS: Benchmarking LLMs on GTFS Understanding and Retrieval. (arXiv:2308.02618v2 [cs.IR] UPDATED)
    The General Transit Feed Specification (GTFS) standard for publishing transit data is ubiquitous. GTFS being tabular data, with information spread across different files, necessitates specialized tools or packages to retrieve information. Concurrently, the use of Large Language Models(LLMs) for text and information retrieval is growing. The idea of this research is to see if the current widely adopted LLMs (ChatGPT) are able to understand GTFS and retrieve information from GTFS using natural language instructions without explicitly providing information. In this research, we benchmark OpenAI's GPT-3.5-Turbo and GPT-4 LLMs which are the backbone of ChatGPT. ChatGPT demonstrates a reasonable understanding of GTFS by answering 59.7% (GPT-3.5-Turbo) and 73.3% (GPT-4) of our multiple-choice questions (MCQ) correctly. Furthermore, we evaluated the LLMs on information extraction tasks using a filtered GTFS feed containing four routes. We found that program synthesis techniques outperformed zero-shot approaches, achieving up to 93% (90%) accuracy for simple queries and 61% (41%) for complex ones using GPT-4 (GPT-3.5-Turbo).
    Detecting and Mitigating Mode-Collapse for Flow-based Sampling of Lattice Field Theories. (arXiv:2302.14082v2 [hep-lat] UPDATED)
    We study the consequences of mode-collapse of normalizing flows in the context of lattice field theory. Normalizing flows allow for independent sampling. For this reason, it is hoped that they can avoid the tunneling problem of local-update MCMC algorithms for multi-modal distributions. In this work, we first point out that the tunneling problem is also present for normalizing flows but is shifted from the sampling to the training phase of the algorithm. Specifically, normalizing flows often suffer from mode-collapse for which the training process assigns vanishingly low probability mass to relevant modes of the physical distribution. This may result in a significant bias when the flow is used as a sampler in a Markov-Chain or with Importance Sampling. We propose a metric to quantify the degree of mode-collapse and derive a bound on the resulting bias. Furthermore, we propose various mitigation strategies in particular in the context of estimating thermodynamic observables, such as the free energy.
    Proportional Response: Contextual Bandits for Simple and Cumulative Regret Minimization. (arXiv:2307.02108v3 [cs.LG] UPDATED)
    In many applications, e.g. in healthcare and e-commerce, the goal of a contextual bandit may be to learn an optimal treatment assignment policy at the end of the experiment. That is, to minimize simple regret. However, this objective remains understudied. We propose a new family of computationally efficient bandit algorithms for the stochastic contextual bandit setting, where a tuning parameter determines the weight placed on cumulative regret minimization (where we establish near-optimal minimax guarantees) versus simple regret minimization (where we establish state-of-the-art guarantees). Our algorithms work with any function class, are robust to model misspecification, and can be used in continuous arm settings. This flexibility comes from constructing and relying on "conformal arm sets" (CASs). CASs provide a set of arms for every context, encompassing the context-specific optimal arm with a certain probability across the context distribution. Our positive results on simple and cumulative regret guarantees are contrasted with a negative result, which shows that no algorithm can achieve instance-dependent simple regret guarantees while simultaneously achieving minimax optimal cumulative regret guarantees.
    Latent Diffusion Model for Conditional Reservoir Facies Generation. (arXiv:2311.01968v1 [physics.geo-ph])
    Creating accurate and geologically realistic reservoir facies based on limited measurements is crucial for field development and reservoir management, especially in the oil and gas sector. Traditional two-point geostatistics, while foundational, often struggle to capture complex geological patterns. Multi-point statistics offers more flexibility, but comes with its own challenges. With the rise of Generative Adversarial Networks (GANs) and their success in various fields, there has been a shift towards using them for facies generation. However, recent advances in the computer vision domain have shown the superiority of diffusion models over GANs. Motivated by this, a novel Latent Diffusion Model is proposed, which is specifically designed for conditional generation of reservoir facies. The proposed model produces high-fidelity facies realizations that rigorously preserve conditioning data. It significantly outperforms a GAN-based alternative.
    A Unified Approach for Maximizing Continuous DR-submodular Functions. (arXiv:2305.16671v2 [cs.LG] UPDATED)
    This paper presents a unified approach for maximizing continuous DR-submodular functions that encompasses a range of settings and oracle access types. Our approach includes a Frank-Wolfe type offline algorithm for both monotone and non-monotone functions, with different restrictions on the general convex set. We consider settings where the oracle provides access to either the gradient of the function or only the function value, and where the oracle access is either deterministic or stochastic. We determine the number of required oracle accesses in all cases. Our approach gives new/improved results for nine out of the sixteen considered cases, avoids computationally expensive projections in two cases, with the proposed framework matching performance of state-of-the-art approaches in the remaining five cases. Notably, our approach for the stochastic function value-based oracle enables the first regret bounds with bandit feedback for stochastic DR-submodular functions.
    Multi-Task Learning to Enhance Generalizability of Neural Network Equalizers in Coherent Optical Systems. (arXiv:2307.05374v3 [eess.SP] UPDATED)
    For the first time, multi-task learning is proposed to improve the flexibility of NN-based equalizers in coherent systems. A "single" NN-based equalizer improves Q-factor by up to 4 dB compared to CDC, without re-training, even with variations in launch power, symbol rate, or transmission distance.
    Transport, Variational Inference and Diffusions: with Applications to Annealed Flows and Schr\"odinger Bridges. (arXiv:2307.01050v3 [stat.ML] UPDATED)
    This paper explores the connections between optimal transport and variational inference, with a focus on forward and reverse time stochastic differential equations and Girsanov transformations.We present a principled and systematic framework for sampling and generative modelling centred around divergences on path space. Our work culminates in the development of a novel score-based annealed flow technique (with connections to Jarzynski and Crooks identities from statistical physics) and a regularised iterative proportional fitting (IPF)-type objective, departing from the sequential nature of standard IPF. Through a series of generative modelling examples and a double-well-based rare event task, we showcase the potential of the proposed methods.
    Long Sequence Hopfield Memory. (arXiv:2306.04532v2 [cs.NE] CROSS LISTED)
    Sequence memory is an essential attribute of natural and artificial intelligence that enables agents to encode, store, and retrieve complex sequences of stimuli and actions. Computational models of sequence memory have been proposed where recurrent Hopfield-like neural networks are trained with temporally asymmetric Hebbian rules. However, these networks suffer from limited sequence capacity (maximal length of the stored sequence) due to interference between the memories. Inspired by recent work on Dense Associative Memories, we expand the sequence capacity of these models by introducing a nonlinear interaction term, enhancing separation between the patterns. We derive novel scaling laws for sequence capacity with respect to network size, significantly outperforming existing scaling laws for models based on traditional Hopfield networks, and verify these theoretical results with numerical simulation. Moreover, we introduce a generalized pseudoinverse rule to recall sequences of highly correlated patterns. Finally, we extend this model to store sequences with variable timing between states' transitions and describe a biologically-plausible implementation, with connections to motor neuroscience.
    Differentially Private Topological Data Analysis. (arXiv:2305.03609v2 [stat.ML] UPDATED)
    This paper is the first to attempt differentially private (DP) topological data analysis (TDA), producing near-optimal private persistence diagrams. We analyze the sensitivity of persistence diagrams in terms of the bottleneck distance, and we show that the commonly used \v{C}ech complex has sensitivity that does not decrease as the sample size $n$ increases. This makes it challenging for the persistence diagrams of \v{C}ech complexes to be privatized. As an alternative, we show that the persistence diagram obtained by the $L^1$-distance to measure (DTM) has sensitivity $O(1/n)$. Based on the sensitivity analysis, we propose using the exponential mechanism whose utility function is defined in terms of the bottleneck distance of the $L^1$-DTM persistence diagrams. We also derive upper and lower bounds of the accuracy of our privacy mechanism; the obtained bounds indicate that the privacy error of our mechanism is near-optimal. We demonstrate the performance of our privatized persistence diagrams through simulations as well as on a real dataset tracking human movement.
    Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders. (arXiv:2310.08571v2 [cs.LG] UPDATED)
    Machine Learning as a Service (MLaaS) APIs provide ready-to-use and high-utility encoders that generate vector representations for given inputs. Since these encoders are very costly to train, they become lucrative targets for model stealing attacks during which an adversary leverages query access to the API to replicate the encoder locally at a fraction of the original training costs. We propose Bucks for Buckets (B4B), the first active defense that prevents stealing while the attack is happening without degrading representation quality for legitimate API users. Our defense relies on the observation that the representations returned to adversaries who try to steal the encoder's functionality cover a significantly larger fraction of the embedding space than representations of legitimate users who utilize the encoder to solve a particular downstream task.vB4B leverages this to adaptively adjust the utility of the returned representations according to a user's coverage of the embedding space. To prevent adaptive adversaries from eluding our defense by simply creating multiple user accounts (sybils), B4B also individually transforms each user's representations. This prevents the adversary from directly aggregating representations over multiple accounts to create their stolen encoder copy. Our active defense opens a new path towards securely sharing and democratizing encoders over public APIs.
    Detection of keratoconus Diseases using deep Learning. (arXiv:2311.01996v1 [eess.IV])
    One of the most serious corneal disorders, keratoconus is difficult to diagnose in its early stages and can result in blindness. This illness, which often appears in the second decade of life, affects people of all sexes and races. Convolutional neural networks (CNNs), one of the deep learning approaches, have recently come to light as particularly promising tools for the accurate and timely diagnosis of keratoconus. The purpose of this study was to evaluate how well different D-CNN models identified keratoconus-related diseases. To be more precise, we compared five different CNN-based deep learning architectures (DenseNet201, InceptionV3, MobileNetV2, VGG19, Xception). In our comprehensive experimental analysis, the DenseNet201-based model performed very well in keratoconus disease identification in our extensive experimental research. This model outperformed its D-CNN equivalents, with an astounding accuracy rate of 89.14% in three crucial classes: Keratoconus, Normal, and Suspect. The results demonstrate not only the stability and robustness of the model but also its practical usefulness in real-world applications for accurate and dependable keratoconus identification. In addition, D-CNN DenseNet201 performs extraordinarily well in terms of precision, recall rates, and F1 scores in addition to accuracy. These measures validate the model's usefulness as an effective diagnostic tool by highlighting its capacity to reliably detect instances of keratoconus and to reduce false positives and negatives.
    Reproducible Parameter Inference Using Bagged Posteriors. (arXiv:2311.02019v1 [stat.ME])
    Under model misspecification, it is known that Bayesian posteriors often do not properly quantify uncertainty about true or pseudo-true parameters. Even more fundamentally, misspecification leads to a lack of reproducibility in the sense that the same model will yield contradictory posteriors on independent data sets from the true distribution. To define a criterion for reproducible uncertainty quantification under misspecification, we consider the probability that two confidence sets constructed from independent data sets have nonempty overlap, and we establish a lower bound on this overlap probability that holds for any valid confidence sets. We prove that credible sets from the standard posterior can strongly violate this bound, particularly in high-dimensional settings (i.e., with dimension increasing with sample size), indicating that it is not internally coherent under misspecification. To improve reproducibility in an easy-to-use and widely applicable way, we propose to apply bagging to the Bayesian posterior ("BayesBag"'); that is, to use the average of posterior distributions conditioned on bootstrapped datasets. We motivate BayesBag from first principles based on Jeffrey conditionalization and show that the bagged posterior typically satisfies the overlap lower bound. Further, we prove a Bernstein--Von Mises theorem for the bagged posterior, establishing its asymptotic normal distribution. We demonstrate the benefits of BayesBag via simulation experiments and an application to crime rate prediction.
    GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent. (arXiv:2305.03515v4 [cs.LG] UPDATED)
    Decision Trees (DTs) are commonly used for many machine learning tasks due to their high degree of interpretability. However, learning a DT from data is a difficult optimization problem, as it is non-convex and non-differentiable. Therefore, common approaches learn DTs using a greedy growth algorithm that minimizes the impurity locally at each internal node. Unfortunately, this greedy procedure can lead to inaccurate trees. In this paper, we present a novel approach for learning hard, axis-aligned DTs with gradient descent. The proposed method uses backpropagation with a straight-through operator on a dense DT representation, to jointly optimize all tree parameters. Our approach outperforms existing methods on binary classification benchmarks and achieves competitive results for multi-class tasks. The method is available under: https://github.com/s-marton/GradTree
    Quantum circuit synthesis with diffusion models. (arXiv:2311.02041v1 [quant-ph])
    Quantum computing has recently emerged as a transformative technology. Yet, its promised advantages rely on efficiently translating quantum operations into viable physical realizations. In this work, we use generative machine learning models, specifically denoising diffusion models (DMs), to facilitate this transformation. Leveraging text-conditioning, we steer the model to produce desired quantum operations within gate-based quantum circuits. Notably, DMs allow to sidestep during training the exponential overhead inherent in the classical simulation of quantum dynamics -- a consistent bottleneck in preceding ML techniques. We demonstrate the model's capabilities across two tasks: entanglement generation and unitary compilation. The model excels at generating new circuits and supports typical DM extensions such as masking and editing to, for instance, align the circuit generation to the constraints of the targeted quantum device. Given their flexibility and generalization abilities, we envision DMs as pivotal in quantum circuit synthesis, enhancing both practical applications but also insights into theoretical quantum computation.
    Towards Abstractive Timeline Summarisation using Preference-based Reinforcement Learning. (arXiv:2211.07596v2 [cs.LG] UPDATED)
    This paper introduces a novel pipeline for summarising timelines of events reported by multiple news sources. Transformer-based models for abstractive summarisation generate coherent and concise summaries of long documents but can fail to outperform established extractive methods on specialised tasks such as timeline summarisation (TLS). While extractive summaries are more faithful to their sources, they may be less readable and contain redundant or unnecessary information. This paper proposes a preference-based reinforcement learning (PBRL) method for adapting pretrained abstractive summarisers to TLS, which can overcome the drawbacks of extractive timeline summaries. We define a compound reward function that learns from keywords of interest and pairwise preference labels, which we use to fine-tune a pretrained abstractive summariser via offline reinforcement learning. We carry out both automated and human evaluation on three datasets, finding that our method outperforms a comparable extractive TLS method on two of the three benchmark datasets, and participants prefer our method's summaries to those of both the extractive TLS method and the pretrained abstractive model. The method does not require expensive reference summaries and needs only a small number of preferences to align the generated summaries with human preferences.
    Remember what you did so you know what to do next. (arXiv:2311.01468v1 [cs.CL])
    We explore using a moderately sized large language model (GPT-J 6B parameters) to create a plan for a simulated robot to achieve 30 classes of goals in ScienceWorld, a text game simulator for elementary science experiments. Previously published empirical work claimed that large language models (LLMs) are a poor fit (Wang et al., 2022) compared to reinforcement learning. Using the Markov assumption (a single previous step), the LLM outperforms the reinforcement learning-based approach by a factor of 1.4. When we fill the LLM's input buffer with as many prior steps as possible, improvement rises to 3.5x. Even when training on only 6.5% of the training data, we observe a 2.2x improvement over the reinforcement-learning-based approach. Our experiments show that performance varies widely across the 30 classes of actions, indicating that averaging over tasks can hide significant performance issues. In work contemporaneous with ours, Lin et al. (2023) demonstrated a two-part approach (SwiftSage) that uses a small LLM (T5-large) complemented by OpenAI's massive LLMs to achieve outstanding results in ScienceWorld. Our 6-B parameter, single-stage GPT-J matches the performance of SwiftSage's two-stage architecture when it incorporates GPT-3.5 turbo which has 29-times more parameters than GPT-J.
    RiskQ: Risk-sensitive Multi-Agent Reinforcement Learning Value Factorization. (arXiv:2311.01753v1 [cs.MA])
    Multi-agent systems are characterized by environmental uncertainty, varying policies of agents, and partial observability, which result in significant risks. In the context of Multi-Agent Reinforcement Learning (MARL), learning coordinated and decentralized policies that are sensitive to risk is challenging. To formulate the coordination requirements in risk-sensitive MARL, we introduce the Risk-sensitive Individual-Global-Max (RIGM) principle as a generalization of the Individual-Global-Max (IGM) and Distributional IGM (DIGM) principles. This principle requires that the collection of risk-sensitive action selections of each agent should be equivalent to the risk-sensitive action selection of the central policy. Current MARL value factorization methods do not satisfy the RIGM principle for common risk metrics such as the Value at Risk (VaR) metric or distorted risk measurements. Therefore, we propose RiskQ to address this limitation, which models the joint return distribution by modeling quantiles of it as weighted quantile mixtures of per-agent return distribution utilities. RiskQ satisfies the RIGM principle for the VaR and distorted risk metrics. We show that RiskQ can obtain promising performance through extensive experiments. The source code of RiskQ is available in https://github.com/xmu-rl-3dv/RiskQ.
    Active Learning-Based Species Range Estimation. (arXiv:2311.02061v1 [cs.LG])
    We propose a new active learning approach for efficiently estimating the geographic range of a species from a limited number of on the ground observations. We model the range of an unmapped species of interest as the weighted combination of estimated ranges obtained from a set of different species. We show that it is possible to generate this candidate set of ranges by using models that have been trained on large weakly supervised community collected observation data. From this, we develop a new active querying approach that sequentially selects geographic locations to visit that best reduce our uncertainty over an unmapped species' range. We conduct a detailed evaluation of our approach and compare it to existing active learning methods using an evaluation dataset containing expert-derived ranges for one thousand species. Our results demonstrate that our method outperforms alternative active learning methods and approaches the performance of end-to-end trained models, even when only using a fraction of the data. This highlights the utility of active learning via transfer learned spatial representations for species range estimation. It also emphasizes the value of leveraging emerging large-scale crowdsourced datasets, not only for modeling a species' range, but also for actively discovering them.
    Feature-Attending Recurrent Modules for Generalization in Reinforcement Learning. (arXiv:2112.08369v3 [cs.LG] UPDATED)
    Many important tasks are defined in terms of object. To generalize across these tasks, a reinforcement learning (RL) agent needs to exploit the structure that the objects induce. Prior work has either hard-coded object-centric features, used complex object-centric generative models, or updated state using local spatial features. However, these approaches have had limited success in enabling general RL agents. Motivated by this, we introduce "Feature-Attending Recurrent Modules" (FARM), an architecture for learning state representations that relies on simple, broadly applicable inductive biases for capturing spatial and temporal regularities. FARM learns a state representation that is distributed across multiple modules that each attend to spatiotemporal features with an expressive feature attention mechanism. We show that this improves an RL agent's ability to generalize across object-centric tasks. We study task suites in both 2D and 3D environments and find that FARM better generalizes compared to competing architectures that leverage attention or multiple modules.
    Detecting Out-of-Distribution Through the Lens of Neural Collapse. (arXiv:2311.01479v1 [cs.LG])
    Out-of-distribution (OOD) detection is essential for the safe deployment of AI. Particularly, OOD detectors should generalize effectively across diverse scenarios. To improve upon the generalizability of existing OOD detectors, we introduce a highly versatile OOD detector, called Neural Collapse inspired OOD detector (NC-OOD). We extend the prevalent observation that in-distribution (ID) features tend to form clusters, whereas OOD features are far away. Particularly, based on the recent observation, Neural Collapse, we further demonstrate that ID features tend to cluster in proximity to weight vectors. From our extended observation, we propose to detect OOD based on feature proximity to weight vectors. To further rule out OOD samples, we leverage the observation that OOD features tend to reside closer to the origin than ID features. Extensive experiments show that our approach enhances the generalizability of existing work and can consistently achieve state-of-the-art OOD detection performance across a wide range of OOD Benchmarks over different classification tasks, training losses, and model architectures.
    FedSN: A General Federated Learning Framework over LEO Satellite Networks. (arXiv:2311.01483v1 [cs.LG])
    Recently, a large number of Low Earth Orbit (LEO) satellites have been launched and deployed successfully in space by commercial companies, such as SpaceX. Due to multimodal sensors equipped by the LEO satellites, they serve not only for communication but also for various machine learning applications, such as space modulation recognition, remote sensing image classification, etc. However, the ground station (GS) may be incapable of downloading such a large volume of raw sensing data for centralized model training due to the limited contact time with LEO satellites (e.g. 5 minutes). Therefore, federated learning (FL) has emerged as the promising solution to address this problem via on-device training. Unfortunately, to enable FL on LEO satellites, we still face three critical challenges that are i) heterogeneous computing and memory capabilities, ii) limited uplink rate, and iii) model staleness. To this end, we propose FedSN as a general FL framework to tackle the above challenges, and fully explore data diversity on LEO satellites. Specifically, we first present a novel sub-structure scheme to enable heterogeneous local model training considering different computing, memory, and communication constraints on LEO satellites. Additionally, we propose a pseudo-synchronous model aggregation strategy to dynamically schedule model aggregation for compensating model staleness. To further demonstrate the effectiveness of the FedSN, we evaluate it using space modulation recognition and remote sensing image classification tasks by leveraging the data from real-world satellite networks. Extensive experimental results demonstrate that FedSN framework achieves higher accuracy, lower computing, and communication overhead than the state-of-the-art benchmarks and the effectiveness of each components in FedSN.
    An Ensemble Machine Learning Approach for Screening Covid-19 based on Urine Parameters. (arXiv:2311.01854v1 [eess.IV])
    The rapid spread of COVID-19 and the emergence of new variants underscore the importance of effective screening measures. Rapid diagnosis and subsequent quarantine of infected individuals can prevent further spread of the virus in society. While PCR tests are the gold standard for COVID-19 diagnosis, they are costly and time-consuming. In contrast, urine test strips are an inexpensive, non-invasive, and rapidly obtainable screening method that can provide important information about a patient's health status. In this study, we collected a new dataset and used the RGB (Red Green Blue) color space of urine test strips parameters to detect the health status of individuals. To improve the accuracy of our model, we converted the RGB space to 10 additional color spaces. After evaluating four different machine learning models, we proposed a new ensemble model based on a multi-layer perceptron neural network. Although the initial results were not strong, we were able to improve the model's screening performance for COVID-19 by removing uncertain regions of the model space. Ultimately, our model achieved a screening accuracy of 80% based on urine parameters. Our results suggest that urine test strips can be a useful tool for COVID-19 screening, particularly in resource-constrained settings where PCR testing may not be feasible. Further research is needed to validate our findings and explore the potential role of urine test strips in COVID-19 diagnosis and management.
    LOTUS: Continual Imitation Learning for Robot Manipulation Through Unsupervised Skill Discovery. (arXiv:2311.02058v1 [cs.RO])
    We introduce LOTUS, a continual imitation learning algorithm that empowers a physical robot to continuously and efficiently learn to solve new manipulation tasks throughout its lifespan. The core idea behind LOTUS is constructing an ever-growing skill library from a sequence of new tasks with a small number of human demonstrations. LOTUS starts with a continual skill discovery process using an open-vocabulary vision model, which extracts skills as recurring patterns presented in unsegmented demonstrations. Continual skill discovery updates existing skills to avoid catastrophic forgetting of previous tasks and adds new skills to solve novel tasks. LOTUS trains a meta-controller that flexibly composes various skills to tackle vision-based manipulation tasks in the lifelong learning process. Our comprehensive experiments show that LOTUS outperforms state-of-the-art baselines by over 11% in success rate, showing its superior knowledge transfer ability compared to prior methods. More results and videos can be found on the project website: https://ut-austin-rpl.github.io/Lotus/.
    Doubly Robust Self-Training. (arXiv:2306.00265v3 [cs.LG] UPDATED)
    Self-training is an important technique for solving semi-supervised learning problems. It leverages unlabeled data by generating pseudo-labels and combining them with a limited labeled dataset for training. The effectiveness of self-training heavily relies on the accuracy of these pseudo-labels. In this paper, we introduce doubly robust self-training, a novel semi-supervised algorithm that provably balances between two extremes. When the pseudo-labels are entirely incorrect, our method reduces to a training process solely using labeled data. Conversely, when the pseudo-labels are completely accurate, our method transforms into a training process utilizing all pseudo-labeled data and labeled data, thus increasing the effective sample size. Through empirical evaluations on both the ImageNet dataset for image classification and the nuScenes autonomous driving dataset for 3D object detection, we demonstrate the superiority of the doubly robust loss over the standard self-training baseline.
    Global Optimization: A Machine Learning Approach. (arXiv:2311.01742v1 [math.OC])
    Many approaches for addressing Global Optimization problems typically rely on relaxations of nonlinear constraints over specific mathematical primitives. This is restricting in applications with constraints that are black-box, implicit or consist of more general primitives. Trying to address such limitations, Bertsimas and Ozturk (2023) proposed OCTHaGOn as a way of solving black-box global optimization problems by approximating the nonlinear constraints using hyperplane-based Decision-Trees and then using those trees to construct a unified mixed integer optimization (MIO) approximation of the original problem. We provide extensions to this approach, by (i) approximating the original problem using other MIO-representable ML models besides Decision Trees, such as Gradient Boosted Trees, Multi Layer Perceptrons and Suport Vector Machines, (ii) proposing adaptive sampling procedures for more accurate machine learning-based constraint approximations, (iii) utilizing robust optimization to account for the uncertainty of the sample-dependent training of the ML models, and (iv) leveraging a family of relaxations to address the infeasibilities of the final MIO approximation. We then test the enhanced framework in 81 Global Optimization instances. We show improvements in solution feasibility and optimality in the majority of instances. We also compare against BARON, showing improved optimality gaps or solution times in 11 instances.
    Spectral Clustering of Attributed Multi-relational Graphs. (arXiv:2311.01840v1 [cs.LG])
    Graph clustering aims at discovering a natural grouping of the nodes such that similar nodes are assigned to a common cluster. Many different algorithms have been proposed in the literature: for simple graphs, for graphs with attributes associated to nodes, and for graphs where edges represent different types of relations among nodes. However, complex data in many domains can be represented as both attributed and multi-relational networks. In this paper, we propose SpectralMix, a joint dimensionality reduction technique for multi-relational graphs with categorical node attributes. SpectralMix integrates all information available from the attributes, the different types of relations, and the graph structure to enable a sound interpretation of the clustering results. Moreover, it generalizes existing techniques: it reduces to spectral embedding and clustering when only applied to a single graph and to homogeneity analysis when applied to categorical data. Experiments conducted on several real-world datasets enable us to detect dependencies between graph structure and categorical attributes, moreover, they exhibit the superiority of SpectralMix over existing methods.
    TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices. (arXiv:2311.01759v1 [cs.LG])
    Developing deep learning models on tiny devices (e.g. Microcontroller units, MCUs) has attracted much attention in various embedded IoT applications. However, it is challenging to efficiently design and deploy recent advanced models (e.g. transformers) on tiny devices due to their severe hardware resource constraints. In this work, we propose TinyFormer, a framework specifically designed to develop and deploy resource-efficient transformers on MCUs. TinyFormer mainly consists of SuperNAS, SparseNAS and SparseEngine. Separately, SuperNAS aims to search for an appropriate supernet from a vast search space. SparseNAS evaluates the best sparse single-path model including transformer architecture from the identified supernet. Finally, SparseEngine efficiently deploys the searched sparse models onto MCUs. To the best of our knowledge, SparseEngine is the first deployment framework capable of performing inference of sparse models with transformer on MCUs. Evaluation results on the CIFAR-10 dataset demonstrate that TinyFormer can develop efficient transformers with an accuracy of $96.1\%$ while adhering to hardware constraints of $1$MB storage and $320$KB memory. Additionally, TinyFormer achieves significant speedups in sparse inference, up to $12.2\times$, when compared to the CMSIS-NN library. TinyFormer is believed to bring powerful transformers into TinyML scenarios and greatly expand the scope of deep learning applications.
    Minimax Quasi-Bayesian estimation in sparse canonical correlation analysis via a Rayleigh quotient function. (arXiv:2010.08627v3 [stat.ML] UPDATED)
    Canonical correlation analysis (CCA) is a popular statistical technique for exploring relationships between datasets. In recent years, the estimation of sparse canonical vectors has emerged as an important but challenging variant of the CCA problem, with widespread applications. Unfortunately, existing rate-optimal estimators for sparse canonical vectors have high computational cost. We propose a quasi-Bayesian estimation procedure that not only achieves the minimax estimation rate, but also is easy to compute by Markov Chain Monte Carlo (MCMC). The method builds on Tan et al. (2018) and uses a re-scaled Rayleigh quotient function as the quasi-log-likelihood. However, unlike Tan et al. (2018), we adopt a Bayesian framework that combines this quasi-log-likelihood with a spike-and-slab prior to regularize the inference and promote sparsity. We investigate the empirical behavior of the proposed method on both continuous and truncated data, and we demonstrate that it outperforms several state-of-the-art methods. As an application, we use the proposed methodology to maximally correlate clinical variables and proteomic data for better understanding the Covid-19 disease.
    Why think step by step? Reasoning emerges from the locality of experience. (arXiv:2304.03843v3 [cs.AI] UPDATED)
    Humans have a powerful and mysterious capacity to reason. Working through a set of mental steps enables us to make inferences we would not be capable of making directly even though we get no additional data from the world. Similarly, when large language models generate intermediate steps (a chain of thought) before answering a question, they often produce better answers than they would directly. We investigate why and how chain-of-thought reasoning is useful in language models, testing the hypothesis that reasoning is effective when training data consists of overlapping local clusters of variables that influence each other strongly. These training conditions enable the chaining of accurate local inferences to estimate relationships between variables that were not seen together in training. We prove that there will exist a "reasoning gap", where reasoning through intermediate variables reduces bias, for the simple case of an autoregressive density estimator trained on local samples from a chain-structured probabilistic model. We then test our hypothesis experimentally in more complex models, training an autoregressive language model on samples from Bayes nets but only including a subset of variables in each sample. We test language models' ability to match conditional probabilities with and without intermediate reasoning steps, finding that intermediate steps are only helpful when the training data is locally structured with respect to dependencies between variables. The combination of locally structured observations and reasoning is much more data-efficient than training on all variables. Our results illustrate how the effectiveness of reasoning step by step is rooted in the local statistical structure of the training data.
    Enhancing Functional Data Analysis with Sequential Neural Networks: Advantages and Comparative Study. (arXiv:2311.01875v1 [cs.LG])
    Functional Data Analysis (FDA) is a statistical domain developed to handle functional data characterized by high dimensionality and complex data structures. Sequential Neural Networks (SNNs) are specialized neural networks capable of processing sequence data, a fundamental aspect of functional data. Despite their great flexibility in modeling functional data, SNNs have been inadequately employed in the FDA community. One notable advantage of SNNs is the ease of implementation, making them accessible to a broad audience beyond academia. Conversely, FDA-based methodologies present challenges, particularly for practitioners outside the field, due to their intricate complexity. In light of this, we propose utilizing SNNs in FDA applications and demonstrate their effectiveness through comparative analyses against popular FDA regression models based on numerical experiments and real-world data analysis. SNN architectures allow us to surpass the limitations of traditional FDA methods, offering scalability, flexibility, and improved analytical performance. Our findings highlight the potential of SNN-based methodologies as powerful tools for data applications involving functional data.
    High Precision Causal Model Evaluation with Conditional Randomization. (arXiv:2311.01902v1 [cs.LG])
    The gold standard for causal model evaluation involves comparing model predictions with true effects estimated from randomized controlled trials (RCT). However, RCTs are not always feasible or ethical to perform. In contrast, conditionally randomized experiments based on inverse probability weighting (IPW) offer a more realistic approach but may suffer from high estimation variance. To tackle this challenge and enhance causal model evaluation in real-world conditional randomization settings, we introduce a novel low-variance estimator for causal error, dubbed as the pairs estimator. By applying the same IPW estimator to both the model and true experimental effects, our estimator effectively cancels out the variance due to IPW and achieves a smaller asymptotic variance. Empirical studies demonstrate the improved of our estimator, highlighting its potential on achieving near-RCT performance. Our method offers a simple yet powerful solution to evaluate causal inference models in conditional randomization settings without complicated modification of the IPW estimator itself, paving the way for more robust and reliable model assessments.
    Conditions on Preference Relations that Guarantee the Existence of Optimal Policies. (arXiv:2311.01990v1 [cs.LG])
    Learning from Preferential Feedback (LfPF) plays an essential role in training Large Language Models, as well as certain types of interactive learning agents. However, a substantial gap exists between the theory and application of LfPF algorithms. Current results guaranteeing the existence of optimal policies in LfPF problems assume that both the preferences and transition dynamics are determined by a Markov Decision Process. We introduce the Direct Preference Process, a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments. Within this framework, we establish conditions that guarantee the existence of optimal policies by considering the ordinal structure of the preferences. Using the von Neumann-Morgenstern Expected Utility Theorem, we show that the Direct Preference Process generalizes the standard reinforcement learning problem. Our findings narrow the gap between the empirical success and theoretical understanding of LfPF algorithms and provide future practitioners with the tools necessary for a more principled design of LfPF agents.
    CheX-Nomaly: Segmenting Lung Abnormalities from Chest Radiographs using Machine Learning. (arXiv:2311.01777v1 [eess.IV])
    The global challenge in chest radiograph X-ray (CXR) abnormalities often being misdiagnosed is primarily associated with perceptual errors, where healthcare providers struggle to accurately identify the location of abnormalities, rather than misclassification errors. We currently address this problem through disease-specific segmentation models. Unfortunately, these models cannot be released in the field due to their lack of generalizability across all thoracic diseases. A binary model tends to perform poorly when it encounters a disease that isn't represented in the dataset. We present CheX-nomaly: a binary localization U-net model that leverages transfer learning techniques with the incorporation of an innovative contrastive learning approach. Trained on the VinDr-CXR dataset, which encompasses 14 distinct diseases in addition to 'no finding' cases, my model achieves generalizability across these 14 diseases and others it has not seen before. We show that we can significantly improve the generalizability of an abnormality localization model by incorporating a contrastive learning method and dissociating the bounding boxes with its disease class. We also introduce a new loss technique to apply to enhance the U-nets performance on bounding box segmentation. By introducing CheX-nomaly, we offer a promising solution to enhance the precision of chest disease diagnosis, with a specific focus on reducing the significant number of perceptual errors in healthcare.
    ForecastPFN: Synthetically-Trained Zero-Shot Forecasting. (arXiv:2311.01933v1 [cs.LG])
    The vast majority of time-series forecasting approaches require a substantial training dataset. However, many real-life forecasting applications have very little initial observations, sometimes just 40 or fewer. Thus, the applicability of most forecasting methods is restricted in data-sparse commercial applications. While there is recent work in the setting of very limited initial data (so-called `zero-shot' forecasting), its performance is inconsistent depending on the data used for pretraining. In this work, we take a different approach and devise ForecastPFN, the first zero-shot forecasting model trained purely on a novel synthetic data distribution. ForecastPFN is a prior-data fitted network, trained to approximate Bayesian inference, which can make predictions on a new time series dataset in a single forward pass. Through extensive experiments, we show that zero-shot predictions made by ForecastPFN are more accurate and faster compared to state-of-the-art forecasting methods, even when the other methods are allowed to train on hundreds of additional in-distribution data points.
    Obtaining Explainable Classification Models using Distributionally Robust Optimization. (arXiv:2311.01994v1 [stat.ML])
    Model explainability is crucial for human users to be able to interpret how a proposed classifier assigns labels to data based on its feature values. We study generalized linear models constructed using sets of feature value rules, which can capture nonlinear dependencies and interactions. An inherent trade-off exists between rule set sparsity and its prediction accuracy. It is computationally expensive to find the right choice of sparsity -- e.g., via cross-validation -- with existing methods. We propose a new formulation to learn an ensemble of rule sets that simultaneously addresses these competing factors. Good generalization is ensured while keeping computational costs low by utilizing distributionally robust optimization. The formulation utilizes column generation to efficiently search the space of rule sets and constructs a sparse ensemble of rule sets, in contrast with techniques like random forests or boosting and their variants. We present theoretical results that motivate and justify the use of our distributionally robust formulation. Extensive numerical experiments establish that our method improves over competing methods -- on a large set of publicly available binary classification problem instances -- with respect to one or more of the following metrics: generalization quality, computational cost, and explainability.
    Recurrent Neural-Linear Posterior Sampling for Nonstationary Contextual Bandits. (arXiv:2007.04750v2 [cs.LG] UPDATED)
    An agent in a nonstationary contextual bandit problem should balance between exploration and the exploitation of (periodic or structured) patterns present in its previous experiences. Handcrafting an appropriate historical context is an attractive alternative to transform a nonstationary problem into a stationary problem that can be solved efficiently. However, even a carefully designed historical context may introduce spurious relationships or lack a convenient representation of crucial information. In order to address these issues, we propose an approach that learns to represent the relevant context for a decision based solely on the raw history of interactions between the agent and the environment. This approach relies on a combination of features extracted by recurrent neural networks with a contextual linear bandit algorithm based on posterior sampling. Our experiments on a diverse selection of contextual and noncontextual nonstationary problems show that our recurrent approach consistently outperforms its feedforward counterpart, which requires handcrafted historical contexts, while being more widely applicable than conventional nonstationary bandit algorithms. Although it is very difficult to provide theoretical performance guarantees for our new approach, we also prove a novel regret bound for linear posterior sampling with measurement error that may serve as a foundation for future theoretical work.
    Characterizing Graph Datasets for Node Classification: Homophily-Heterophily Dichotomy and Beyond. (arXiv:2209.06177v3 [cs.SI] UPDATED)
    Homophily is a graph property describing the tendency of edges to connect similar nodes; the opposite is called heterophily. It is often believed that heterophilous graphs are challenging for standard message-passing graph neural networks (GNNs), and much effort has been put into developing efficient methods for this setting. However, there is no universally agreed-upon measure of homophily in the literature. In this work, we show that commonly used homophily measures have critical drawbacks preventing the comparison of homophily levels across different datasets. For this, we formalize desirable properties for a proper homophily measure and verify which measures satisfy which properties. In particular, we show that a measure that we call adjusted homophily satisfies more desirable properties than other popular homophily measures while being rarely used in graph machine learning literature. Then, we go beyond the homophily-heterophily dichotomy and propose a new characteristic that allows one to further distinguish different sorts of heterophily. The proposed label informativeness (LI) characterizes how much information a neighbor's label provides about a node's label. We prove that this measure satisfies important desirable properties. We also observe empirically that LI better agrees with GNN performance compared to homophily measures, which confirms that it is a useful characteristic of the graph structure.
    Mix-ME: Quality-Diversity for Multi-Agent Learning. (arXiv:2311.01829v1 [cs.LG])
    In many real-world systems, such as adaptive robotics, achieving a single, optimised solution may be insufficient. Instead, a diverse set of high-performing solutions is often required to adapt to varying contexts and requirements. This is the realm of Quality-Diversity (QD), which aims to discover a collection of high-performing solutions, each with their own unique characteristics. QD methods have recently seen success in many domains, including robotics, where they have been used to discover damage-adaptive locomotion controllers. However, most existing work has focused on single-agent settings, despite many tasks of interest being multi-agent. To this end, we introduce Mix-ME, a novel multi-agent variant of the popular MAP-Elites algorithm that forms new solutions using a crossover-like operator by mixing together agents from different teams. We evaluate the proposed methods on a variety of partially observable continuous control tasks. Our evaluation shows that these multi-agent variants obtained by Mix-ME not only compete with single-agent baselines but also often outperform them in multi-agent settings under partial observability.
    High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise. (arXiv:2311.02000v1 [math.OC])
    In this paper, we study the convergence of the Adaptive Moment Estimation (Adam) algorithm under unconstrained non-convex smooth stochastic optimizations. Despite the widespread usage in machine learning areas, its theoretical properties remain limited. Prior researches primarily investigated Adam's convergence from an expectation view, often necessitating strong assumptions like uniformly stochastic bounded gradients or problem-dependent knowledge in prior. As a result, the applicability of these findings in practical real-world scenarios has been constrained. To overcome these limitations, we provide a deep analysis and show that Adam could converge to the stationary point in high probability with a rate of $\mathcal{O}\left({\rm poly}(\log T)/\sqrt{T}\right)$ under coordinate-wise "affine" variance noise, not requiring any bounded gradient assumption and any problem-dependent knowledge in prior to tune hyper-parameters. Additionally, it is revealed that Adam confines its gradients' magnitudes within an order of $\mathcal{O}\left({\rm poly}(\log T)\right)$. Finally, we also investigate a simplified version of Adam without one of the corrective terms and obtain a convergence rate that is adaptive to the noise level.
    A Variational Perspective on High-Resolution ODEs. (arXiv:2311.02002v1 [math.OC])
    We consider unconstrained minimization of smooth convex functions. We propose a novel variational perspective using forced Euler-Lagrange equation that allows for studying high-resolution ODEs. Through this, we obtain a faster convergence rate for gradient norm minimization using Nesterov's accelerated gradient method. Additionally, we show that Nesterov's method can be interpreted as a rate-matching discretization of an appropriately chosen high-resolution ODE. Finally, using the results from the new variational perspective, we propose a stochastic method for noisy gradients. Several numerical experiments compare and illustrate our stochastic algorithm with state of the art methods.
    Heterogeneous federated collaborative filtering using FAIR: Federated Averaging in Random Subspaces. (arXiv:2311.01722v1 [cs.LG])
    Recommendation systems (RS) for items (e.g., movies, books) and ads are widely used to tailor content to users on various internet platforms. Traditionally, recommendation models are trained on a central server. However, due to rising concerns for data privacy and regulations like the GDPR, federated learning is an increasingly popular paradigm in which data never leaves the client device. Applying federated learning to recommendation models is non-trivial due to large embedding tables, which often exceed the memory constraints of most user devices. To include data from all devices in federated learning, we must enable collective training of embedding tables on devices with heterogeneous memory capacities. Current solutions to heterogeneous federated learning can only accommodate a small range of capacities and thus limit the number of devices that can participate in training. We present Federated Averaging in Random subspaces (FAIR), which allows arbitrary compression of embedding tables based on device capacity and ensures the participation of all devices in training. FAIR uses what we call consistent and collapsible subspaces defined by hashing-based random projections to jointly train large embedding tables while using varying amounts of compression on user devices. We evaluate FAIR on Neural Collaborative Filtering tasks with multiple datasets and verify that FAIR can gather and share information from a wide range of devices with varying capacities, allowing for seamless collaboration. We prove the convergence of FAIR in the homogeneous setting with non-i.i.d data distribution. Our code is open source at {https://github.com/apd10/FLCF}
    SEGA: Instructing Text-to-Image Models using Semantic Guidance. (arXiv:2301.12247v2 [cs.CV] UPDATED)
    Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility, flexibility, and improvements over existing methods.
    OpenAGI: When LLM Meets Domain Experts. (arXiv:2304.04370v6 [cs.AI] UPDATED)
    Human Intelligence (HI) excels at combining basic skills to solve complex tasks. This capability is vital for Artificial Intelligence (AI) and should be embedded in comprehensive AI Agents, enabling them to harness expert models for complex task-solving towards Artificial General Intelligence (AGI). Large Language Models (LLMs) show promising learning and reasoning abilities, and can effectively use external models, tools, plugins, or APIs to tackle complex problems. In this work, we introduce OpenAGI, an open-source AGI research and development platform designed for solving multi-step, real-world tasks. Specifically, OpenAGI uses a dual strategy, integrating standard benchmark tasks for benchmarking and evaluation, and open-ended tasks including more expandable models, tools, plugins, or APIs for creative problem-solving. Tasks are presented as natural language queries to the LLM, which then selects and executes appropriate models. We also propose a Reinforcement Learning from Task Feedback (RLTF) mechanism that uses task results to improve the LLM's task-solving ability, which creates a self-improving AI feedback loop. While we acknowledge that AGI is a broad and multifaceted research challenge with no singularly defined solution path, the integration of LLMs with domain-specific expert models, inspired by mirroring the blend of general and specialized intelligence in humans, offers a promising approach towards AGI. We are open-sourcing the OpenAGI project's code, dataset, benchmarks, evaluation methods, and the UI demo to foster community involvement in AGI advancement: https://github.com/agiresearch/OpenAGI.
    A minimax optimal control approach for robust neural ODEs. (arXiv:2310.17584v2 [math.OC] UPDATED)
    In this paper, we address the adversarial training of neural ODEs from a robust control perspective. This is an alternative to the classical training via empirical risk minimization, and it is widely used to enforce reliable outcomes for input perturbations. Neural ODEs allow the interpretation of deep neural networks as discretizations of control systems, unlocking powerful tools from control theory for the development and the understanding of machine learning. In this specific case, we formulate the adversarial training with perturbed data as a minimax optimal control problem, for which we derive first order optimality conditions in the form of Pontryagin's Maximum Principle. We provide a novel interpretation of robust training leading to an alternative weighted technique, which we test on a low-dimensional classification task.
    Applications of the Theory of Aggregated Markov Processes in Stochastic Learning Theory. (arXiv:2311.01476v1 [stat.ML])
    A stochastic process that arises by composing a function with a Markov process is called an aggregated Markov process (AMP). The purpose of composing a Markov process with a function can be a reduction of dimensions, e.g., a projection onto certain coordinates. The theory around AMP has been extensively studied e.g. by Dynkin, Cameron, Rogers and Pitman, and Kelly, all of whom provided sufficient conditions for an AMP to remain Markov. In another direction, Larget provided a canonical representation for AMP, which can be used to verify the equivalence of two AMPs. The purpose of this paper is to describe how the theory of AMP can be applied to stochastic learning theory as they learn a particular task.
    Adversarial Attacks on Cooperative Multi-agent Bandits. (arXiv:2311.01698v1 [cs.LG])
    Cooperative multi-agent multi-armed bandits (CMA2B) consider the collaborative efforts of multiple agents in a shared multi-armed bandit game. We study latent vulnerabilities exposed by this collaboration and consider adversarial attacks on a few agents with the goal of influencing the decisions of the rest. More specifically, we study adversarial attacks on CMA2B in both homogeneous settings, where agents operate with the same arm set, and heterogeneous settings, where agents have distinct arm sets. In the homogeneous setting, we propose attack strategies that, by targeting just one agent, convince all agents to select a particular target arm $T-o(T)$ times while incurring $o(T)$ attack costs in $T$ rounds. In the heterogeneous setting, we prove that a target arm attack requires linear attack costs and propose attack strategies that can force a maximum number of agents to suffer linear regrets while incurring sublinear costs and only manipulating the observations of a few target agents. Numerical experiments validate the effectiveness of our proposed attack strategies.
    Energy Efficiency Optimization for Subterranean LoRaWAN Using A Reinforcement Learning Approach: A Direct-to-Satellite Scenario. (arXiv:2311.01743v1 [cs.IT])
    The integration of subterranean LoRaWAN and non-terrestrial networks (NTN) delivers substantial economic and societal benefits in remote agriculture and disaster rescue operations. The LoRa modulation leverages quasi-orthogonal spreading factors (SFs) to optimize data rates, airtime, coverage and energy consumption. However, it is still challenging to effectively assign SFs to end devices for minimizing co-SF interference in massive subterranean LoRaWAN NTN. To address this, we investigate a reinforcement learning (RL)-based SFs allocation scheme to optimize the system's energy efficiency (EE). To efficiently capture the device-to-environment interactions in dense networks, we proposed an SFs allocation technique using the multi-agent dueling double deep Q-network (MAD3QN) and the multi-agent advantage actor-critic (MAA2C) algorithms based on an analytical reward mechanism. Our proposed RL-based SFs allocation approach evinces better performance compared to four benchmarks in the extreme underground direct-to-satellite scenario. Remarkably, MAD3QN shows promising potentials in surpassing MAA2C in terms of convergence rate and EE.
    Simplifying Transformer Blocks. (arXiv:2311.01906v1 [cs.LG])
    A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable. In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput, and using 15% fewer parameters.
    Faithful and Robust Local Interpretability for Textual Predictions. (arXiv:2311.01605v1 [cs.CL])
    Interpretability is essential for machine learning models to be trusted and deployed in critical domains. However, existing methods for interpreting text models are often complex, lack solid mathematical foundations, and their performance is not guaranteed. In this paper, we propose FRED (Faithful and Robust Explainer for textual Documents), a novel method for interpreting predictions over text. FRED identifies key words in a document that significantly impact the prediction when removed. We establish the reliability of FRED through formal definitions and theoretical analyses on interpretable classifiers. Additionally, our empirical evaluation against state-of-the-art methods demonstrates the effectiveness of FRED in providing insights into text models.
    Physics-Informed Generator-Encoder Adversarial Networks with Latent Space Matching for Stochastic Differential Equations. (arXiv:2311.01708v1 [cs.LG])
    We propose a new class of physics-informed neural networks, called Physics-Informed Generator-Encoder Adversarial Networks, to effectively address the challenges posed by forward, inverse, and mixed problems in stochastic differential equations. In these scenarios, while the governing equations are known, the available data consist of only a limited set of snapshots for system parameters. Our model consists of two key components: the generator and the encoder, both updated alternately by gradient descent. In contrast to previous approaches of directly matching the approximated solutions with real snapshots, we employ an indirect matching that operates within the lower-dimensional latent feature space. This method circumvents challenges associated with high-dimensional inputs and complex data distributions, while yielding more accurate solutions compared to existing neural network solvers. In addition, the approach also mitigates the training instability issues encountered in previous adversarial frameworks in an efficient manner. Numerical results provide compelling evidence of the effectiveness of the proposed method in solving different types of stochastic differential equations.
    Patch-Based Deep Unsupervised Image Segmentation using Graph Cuts. (arXiv:2311.01475v1 [cs.CV])
    Unsupervised image segmentation aims at grouping different semantic patterns in an image without the use of human annotation. Similarly, image clustering searches for groupings of images based on their semantic content without supervision. Classically, both problems have captivated researchers as they drew from sound mathematical concepts to produce concrete applications. With the emergence of deep learning, the scientific community turned its attention to complex neural network-based solvers that achieved impressive results in those domains but rarely leveraged the advances made by classical methods. In this work, we propose a patch-based unsupervised image segmentation strategy that bridges advances in unsupervised feature extraction from deep clustering methods with the algorithmic help of classical graph-based methods. We show that a simple convolutional neural network, trained to classify image patches and iteratively regularized using graph cuts, naturally leads to a state-of-the-art fully-convolutional unsupervised pixel-level segmenter. Furthermore, we demonstrate that this is the ideal setting for leveraging the patch-level pairwise features generated by vision transformer models. Our results on real image data demonstrate the effectiveness of our proposed methodology.
    Invariant Causal Imitation Learning for Generalizable Policies. (arXiv:2311.01489v1 [stat.ML])
    Consider learning an imitation policy on the basis of demonstrated behavior from multiple environments, with an eye towards deployment in an unseen environment. Since the observable features from each setting may be different, directly learning individual policies as mappings from features to actions is prone to spurious correlations -- and may not generalize well. However, the expert's policy is often a function of a shared latent structure underlying those observable features that is invariant across settings. By leveraging data from multiple environments, we propose Invariant Causal Imitation Learning (ICIL), a novel technique in which we learn a feature representation that is invariant across domains, on the basis of which we learn an imitation policy that matches expert behavior. To cope with transition dynamics mismatch, ICIL learns a shared representation of causal features (for all training environments), that is disentangled from the specific representations of noise variables (for each of those environments). Moreover, to ensure that the learned policy matches the observation distribution of the expert's policy, ICIL estimates the energy of the expert's observations and uses a regularization term that minimizes the imitator policy's next state energy. Experimentally, we compare our methods against several benchmarks in control and healthcare tasks and show its effectiveness in learning imitation policies capable of generalizing to unseen environments.
    Detecting Spurious Correlations via Robust Visual Concepts in Real and AI-Generated Image Classification. (arXiv:2311.01655v1 [cs.LG])
    Often machine learning models tend to automatically learn associations present in the training data without questioning their validity or appropriateness. This undesirable property is the root cause of the manifestation of spurious correlations, which render models unreliable and prone to failure in the presence of distribution shifts. Research shows that most methods attempting to remedy spurious correlations are only effective for a model's known spurious associations. Current spurious correlation detection algorithms either rely on extensive human annotations or are too restrictive in their formulation. Moreover, they rely on strict definitions of visual artifacts that may not apply to data produced by generative models, as they are known to hallucinate contents that do not conform to standard specifications. In this work, we introduce a general-purpose method that efficiently detects potential spurious correlations, and requires significantly less human interference in comparison to the prior art. Additionally, the proposed method provides intuitive explanations while eliminating the need for pixel-level annotations. We demonstrate the proposed method's tolerance to the peculiarity of AI-generated images, which is a considerably challenging task, one where most of the existing methods fall short. Consequently, our method is also suitable for detecting spurious correlations that may propagate to downstream applications originating from generative models.
    Creating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AI. (arXiv:2311.01463v1 [cs.CL])
    Large language models have proliferated across multiple domains in as short period of time. There is however hesitation in the medical and healthcare domain towards their adoption because of issues like factuality, coherence, and hallucinations. Give the high stakes nature of healthcare, many researchers have even cautioned against its usage until these issues are resolved. The key to the implementation and deployment of LLMs in healthcare is to make these models trustworthy, transparent (as much possible) and explainable. In this paper we describe the key elements in creating reliable, trustworthy, and unbiased models as a necessary condition for their adoption in healthcare. Specifically we focus on the quantification, validation, and mitigation of hallucinations in the context in healthcare. Lastly, we discuss how the future of LLMs in healthcare may look like.
    Leveraging Language Models to Detect Greenwashing. (arXiv:2311.01469v1 [cs.CL])
    In recent years, climate change repercussions have increasingly captured public interest. Consequently, corporations are emphasizing their environmental efforts in sustainability reports to bolster their public image. Yet, the absence of stringent regulations in review of such reports allows potential greenwashing. In this study, we introduce a novel methodology to train a language model on generated labels for greenwashing risk. Our primary contributions encompass: developing a mathematical formulation to quantify greenwashing risk, a fine-tuned ClimateBERT model for this problem, and a comparative analysis of results. On a test set comprising of sustainability reports, our best model achieved an average accuracy score of 86.34% and F1 score of 0.67, demonstrating that our methods show a promising direction of exploration for this task.
    Look-Ahead Selective Plasticity for Continual Learning of Visual Tasks. (arXiv:2311.01617v1 [cs.CV])
    Contrastive representation learning has emerged as a promising technique for continual learning as it can learn representations that are robust to catastrophic forgetting and generalize well to unseen future tasks. Previous work in continual learning has addressed forgetting by using previous task data and trained models. Inspired by event models created and updated in the brain, we propose a new mechanism that takes place during task boundaries, i.e., when one task finishes and another starts. By observing the redundancy-inducing ability of contrastive loss on the output of a neural network, our method leverages the first few samples of the new task to identify and retain parameters contributing most to the transfer ability of the neural network, freeing up the remaining parts of the network to learn new features. We evaluate the proposed methods on benchmark computer vision datasets including CIFAR10 and TinyImagenet and demonstrate state-of-the-art performance in the task-incremental, class-incremental, and domain-incremental continual learning scenarios.
    Epidemic Decision-making System Based Federated Reinforcement Learning. (arXiv:2311.01749v1 [cs.LG])
    Epidemic decision-making can effectively help the government to comprehensively consider public security and economic development to respond to public health and safety emergencies. Epidemic decision-making can effectively help the government to comprehensively consider public security and economic development to respond to public health and safety emergencies. Some studies have shown that intensive learning can effectively help the government to make epidemic decision, thus achieving the balance between health security and economic development. Some studies have shown that intensive learning can effectively help the government to make epidemic decision, thus achieving the balance between health security and economic development. However, epidemic data often has the characteristics of limited samples and high privacy. However, epidemic data often has the characteristics of limited samples and high privacy. This model can combine the epidemic situation data of various provinces for cooperative training to use as an enhanced learning model for epidemic situation decision, while protecting the privacy of data. The experiment shows that the enhanced federated learning can obtain more optimized performance and return than the enhanced learning, and the enhanced federated learning can also accelerate the training convergence speed of the training model. accelerate the training convergence speed of the client. At the same time, through the experimental comparison, A2C is the most suitable reinforcement learning model for the epidemic situation decision-making. learning model for the epidemic situation decision-making scenario, followed by the PPO model, and the performance of DDPG is unsatisfactory.
    Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization. (arXiv:2311.01544v1 [cs.CL])
    Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. Their ever-increasing size, however, raised concerns about their effective deployment and the need for LLM compressions. This study introduces the Divergent Token metrics (DTMs), a novel approach for assessing compressed LLMs, addressing the limitations of traditional measures like perplexity that fail to accurately reflect text generation quality. DTMs focus on token divergence, providing deeper insights into the subtleties of model compression. Our results indicate that significant levels of precision and sparsity can be achieved without compromising text generation quality. Moreover, DTMs offers a more precise evaluation of each component's impact individually. Utilizing the First Divergent Token metric (FDTM) in model sparsification reveals that nearly 20% of all components can be pruned over 90%. In terms of quantization, the FDTM suggests that over 80% of parameters can be straightforwardly transformed to int8 without special outlier management.
    Adversarial Examples in the Physical World: A Survey. (arXiv:2311.01473v1 [cs.CV])
    Deep neural networks (DNNs) have demonstrated high vulnerability to adversarial examples. Besides the attacks in the digital world, the practical implications of adversarial examples in the physical world present significant challenges and safety concerns. However, current research on physical adversarial examples (PAEs) lacks a comprehensive understanding of their unique characteristics, leading to limited significance and understanding. In this paper, we address this gap by thoroughly examining the characteristics of PAEs within a practical workflow encompassing training, manufacturing, and re-sampling processes. By analyzing the links between physical adversarial attacks, we identify manufacturing and re-sampling as the primary sources of distinct attributes and particularities in PAEs. Leveraging this knowledge, we develop a comprehensive analysis and classification framework for PAEs based on their specific characteristics, covering over 100 studies on physical-world adversarial examples. Furthermore, we investigate defense strategies against PAEs and identify open challenges and opportunities for future research. We aim to provide a fresh, thorough, and systematic understanding of PAEs, thereby promoting the development of robust adversarial learning and its application in open-world scenarios.
    E(2) Equivariant Neural Networks for Robust Galaxy Morphology Classification. (arXiv:2311.01500v1 [astro-ph.GA])
    We propose the use of group convolutional neural network architectures (GCNNs) equivariant to the 2D Euclidean group, $E(2)$, for the task of galaxy morphology classification by utilizing symmetries of the data present in galaxy images as an inductive bias in the architecture. We conduct robustness studies by introducing artificial perturbations via Poisson noise insertion and one-pixel adversarial attacks to simulate the effects of limited observational capabilities. We train, validate, and test GCNNs equivariant to discrete subgroups of $E(2)$ - the cyclic and dihedral groups of order $N$ - on the Galaxy10 DECals dataset and find that GCNNs achieve higher classification accuracy and are consistently more robust than their non-equivariant counterparts, with an architecture equivariant to the group $D_{16}$ achieving a $95.52 \pm 0.18\%$ test-set accuracy. We also find that the model loses $<6\%$ accuracy on a $50\%$-noise dataset and all GCNNs are less susceptible to one-pixel perturbations than an identically constructed CNN. Our code is publicly available at https://github.com/snehjp2/GCNNMorphology.
    SortNet: Learning To Rank By a Neural-Based Sorting Algorithm. (arXiv:2311.01864v1 [cs.LG])
    The problem of relevance ranking consists of sorting a set of objects with respect to a given criterion. Since users may prefer different relevance criteria, the ranking algorithms should be adaptable to the user needs. Two main approaches exist in literature for the task of learning to rank: 1) a score function, learned by examples, which evaluates the properties of each object yielding an absolute relevance value that can be used to order the objects or 2) a pairwise approach, where a "preference function" is learned using pairs of objects to define which one has to be ranked first. In this paper, we present SortNet, an adaptive ranking algorithm which orders objects using a neural network as a comparator. The neural network training set provides examples of the desired ordering between pairs of items and it is constructed by an iterative procedure which, at each iteration, adds the most informative training examples. Moreover, the comparator adopts a connectionist architecture that is particularly suited for implementing a preference function. We also prove that such an architecture has the universal approximation property and can implement a wide class of functions. Finally, the proposed algorithm is evaluated on the LETOR dataset showing promising performances in comparison with other state of the art algorithms.
    Flexible Error Mitigation of Quantum Processes with Data Augmentation Empowered Neural Model. (arXiv:2311.01727v1 [quant-ph])
    Neural networks have shown their effectiveness in various tasks in the realm of quantum computing. However, their application in quantum error mitigation, a crucial step towards realizing practical quantum advancements, has been restricted by reliance on noise-free statistics. To tackle this critical challenge, we propose a data augmentation empowered neural model for error mitigation (DAEM). Our model does not require any prior knowledge about the specific noise type and measurement settings and can estimate noise-free statistics solely from the noisy measurement results of the target quantum process, rendering it highly suitable for practical implementation. In numerical experiments, we show the model's superior performance in mitigating various types of noise, including Markovian noise and Non-Markovian noise, compared with previous error mitigation methods. We further demonstrate its versatility by employing the model to mitigate errors in diverse types of quantum processes, including those involving large-scale quantum systems and continuous-variable quantum states. This powerful data augmentation-empowered neural model for error mitigation establishes a solid foundation for realizing more reliable and robust quantum technologies in practical applications.
    Calibrate and Boost Logical Expressiveness of GNN Over Multi-Relational and Temporal Graphs. (arXiv:2311.01647v1 [cs.LG])
    As a powerful framework for graph representation learning, Graph Neural Networks (GNNs) have garnered significant attention in recent years. However, to the best of our knowledge, there has been no formal analysis of the logical expressiveness of GNNs as Boolean node classifiers over multi-relational graphs, where each edge carries a specific relation type. In this paper, we investigate $\mathcal{FOC}_2$, a fragment of first-order logic with two variables and counting quantifiers. On the negative side, we demonstrate that the R$^2$-GNN architecture, which extends the local message passing GNN by incorporating global readout, fails to capture $\mathcal{FOC}_2$ classifiers in the general case. Nevertheless, on the positive side, we establish that R$^2$-GNNs models are equivalent to $\mathcal{FOC}_2$ classifiers under certain restricted yet reasonable scenarios. To address the limitations of R$^2$-GNNs regarding expressiveness, we propose a simple graph transformation technique, akin to a preprocessing step, which can be executed in linear time. This transformation enables R$^2$-GNNs to effectively capture any $\mathcal{FOC}_2$ classifiers when applied to the "transformed" input graph. Moreover, we extend our analysis of expressiveness and graph transformation to temporal graphs, exploring several temporal GNN architectures and providing an expressiveness hierarchy for them. To validate our findings, we implement R$^2$-GNNs and the graph transformation technique and conduct empirical tests in node classification tasks against various well-known GNN architectures that support multi-relational or temporal graphs. Our experimental results consistently demonstrate that R$^2$-GNN with the graph transformation outperforms the baseline methods on both synthetic and real-world datasets
    Domain Adaptive Graph Neural Networks for Constraining Cosmological Parameters Across Multiple Data Sets. (arXiv:2311.01588v1 [astro-ph.CO])
    Deep learning models have been shown to outperform methods that rely on summary statistics, like the power spectrum, in extracting information from complex cosmological data sets. However, due to differences in the subgrid physics implementation and numerical approximations across different simulation suites, models trained on data from one cosmological simulation show a drop in performance when tested on another. Similarly, models trained on any of the simulations would also likely experience a drop in performance when applied to observational data. Training on data from two different suites of the CAMELS hydrodynamic cosmological simulations, we examine the generalization capabilities of Domain Adaptive Graph Neural Networks (DA-GNNs). By utilizing GNNs, we capitalize on their capacity to capture structured scale-free cosmological information from galaxy distributions. Moreover, by including unsupervised domain adaptation via Maximum Mean Discrepancy (MMD), we enable our models to extract domain-invariant features. We demonstrate that DA-GNN achieves higher accuracy and robustness on cross-dataset tasks (up to $28\%$ better relative error and up to almost an order of magnitude better $\chi^2$). Using data visualizations, we show the effects of domain adaptation on proper latent space data alignment. This shows that DA-GNNs are a promising method for extracting domain-independent cosmological information, a vital step toward robust deep learning for real cosmic survey data.
    Amide Proton Transfer (APT) imaging in tumor with a machine learning approach using partially synthetic data. (arXiv:2311.01683v1 [physics.med-ph])
    Machine learning (ML) has been increasingly used to quantify chemical exchange saturation transfer (CEST) effect. ML models are typically trained using either measured data or fully simulated data. However, training with measured data often lacks sufficient training data, while training with fully simulated data may introduce bias due to limited simulations pools. This study introduces a new platform that combines simulated and measured components to generate partially synthetic CEST data, and to evaluate its feasibility for training ML models to predict amide proton transfer (APT) effect. Partially synthetic CEST signals were created using an inverse summation of APT effects from simulations and the other components from measurements. Training data were generated by varying APT simulation parameters and applying scaling factors to adjust the measured components, achieving a balance between simulation flexibility and fidelity. First, tissue-mimicking CEST signals along with ground truth information were created using multiple-pool model simulations to validate this method. Second, an ML model was trained individually on partially synthetic data, in vivo data, and fully simulated data, to predict APT effect in rat brains bearing 9L tumors. Experiments on tissue-mimicking data suggest that the ML method using the partially synthetic data is accurate in predicting APT. In vivo experiments suggest that our method provides more accurate and robust prediction than the training using in vivo data and fully synthetic data. Partially synthetic CEST data can address the challenges in conventional ML methods.
    SemiGPC: Distribution-Aware Label Refinement for Imbalanced Semi-Supervised Learning Using Gaussian Processes. (arXiv:2311.01646v1 [cs.CV])
    In this paper we introduce SemiGPC, a distribution-aware label refinement strategy based on Gaussian Processes where the predictions of the model are derived from the labels posterior distribution. Differently from other buffer-based semi-supervised methods such as CoMatch and SimMatch, our SemiGPC includes a normalization term that addresses imbalances in the global data distribution while maintaining local sensitivity. This explicit control allows SemiGPC to be more robust to confirmation bias especially under class imbalance. We show that SemiGPC improves performance when paired with different Semi-Supervised methods such as FixMatch, ReMixMatch, SimMatch and FreeMatch and different pre-training strategies including MSN and Dino. We also show that SemiGPC achieves state of the art results under different degrees of class imbalance on standard CIFAR10-LT/CIFAR100-LT especially in the low data-regime. Using SemiGPC also results in about 2% avg.accuracy increase compared to a new competitive baseline on the more challenging benchmarks SemiAves, SemiCUB, SemiFungi and Semi-iNat.
    ATGNN: Audio Tagging Graph Neural Network. (arXiv:2311.01526v1 [cs.SD])
    Deep learning models such as CNNs and Transformers have achieved impressive performance for end-to-end audio tagging. Recent works have shown that despite stacking multiple layers, the receptive field of CNNs remains severely limited. Transformers on the other hand are able to map global context through self-attention, but treat the spectrogram as a sequence of patches which is not flexible enough to capture irregular audio objects. In this work, we treat the spectrogram in a more flexible way by considering it as graph structure and process it with a novel graph neural architecture called ATGNN. ATGNN not only combines the capability of CNNs with the global information sharing ability of Graph Neural Networks, but also maps semantic relationships between learnable class embeddings and corresponding spectrogram regions. We evaluate ATGNN on two audio tagging tasks, where it achieves 0.585 mAP on the FSD50K dataset and 0.335 mAP on the AudioSet-balanced dataset, achieving comparable results to Transformer based models with significantly lower number of learnable parameters.
    Investigating the Behavior of Diffusion Models for Accelerating Electronic Structure Calculations. (arXiv:2311.01491v1 [physics.chem-ph])
    We present an investigation into diffusion models for molecular generation, with the aim of better understanding how their predictions compare to the results of physics-based calculations. The investigation into these models is driven by their potential to significantly accelerate electronic structure calculations using machine learning, without requiring expensive first-principles datasets for training interatomic potentials. We find that the inference process of a popular diffusion model for de novo molecular generation is divided into an exploration phase, where the model chooses the atomic species, and a relaxation phase, where it adjusts the atomic coordinates to find a low-energy geometry. As training proceeds, we show that the model initially learns about the first-order structure of the potential energy surface, and then later learns about higher-order structure. We also find that the relaxation phase of the diffusion model can be re-purposed to sample the Boltzmann distribution over conformations and to carry out structure relaxations. For structure relaxations, the model finds geometries with ~10x lower energy than those produced by a classical force field for small organic molecules. Initializing a density functional theory (DFT) relaxation at the diffusion-produced structures yields a >2x speedup to the DFT relaxation when compared to initializing at structures relaxed with a classical force field.
    An Efficient Detection and Control System for Underwater Docking using Machine Learning and Realistic Simulation: A Comprehensive Approach. (arXiv:2311.01522v1 [cs.RO])
    Underwater docking is critical to enable the persistent operation of Autonomous Underwater Vehicles (AUVs). For this, the AUV must be capable of detecting and localizing the docking station, which is complex due to the highly dynamic undersea environment. Image-based solutions offer a high acquisition rate and versatile alternative to adapt to this environment; however, the underwater environment presents challenges such as low visibility, high turbidity, and distortion. In addition to this, field experiments to validate underwater docking capabilities can be costly and dangerous due to the specialized equipment and safety considerations required to conduct the experiments. This work compares different deep-learning architectures to perform underwater docking detection and classification. The architecture with the best performance is then compressed using knowledge distillation under the teacher-student paradigm to reduce the network's memory footprint, allowing real-time implementation. To reduce the simulation-to-reality gap, a Generative Adversarial Network (GAN) is used to do image-to-image translation, converting the Gazebo simulation image into a realistic underwater-looking image. The obtained image is then processed using an underwater image formation model to simulate image attenuation over distance under different water types. The proposed method is finally evaluated according to the AUV docking success rate and compared with classical vision methods. The simulation results show an improvement of 20% in the high turbidity scenarios regardless of the underwater currents. Furthermore, we show the performance of the proposed approach by showing experimental results on the off-the-shelf AUV Iver3.
    Local Borsuk-Ulam, Stability, and Replicability. (arXiv:2311.01599v1 [cs.LG])
    We use and adapt the Borsuk-Ulam Theorem from topology to derive limitations on list-replicable and globally stable learning algorithms. We further demonstrate the applicability of our methods in combinatorics and topology. We show that, besides trivial cases, both list-replicable and globally stable learning are impossible in the agnostic PAC setting. This is in contrast with the realizable case where it is known that any class with a finite Littlestone dimension can be learned by such algorithms. In the realizable PAC setting, we sharpen previous impossibility results and broaden their scope. Specifically, we establish optimal bounds for list replicability and global stability numbers in finite classes. This provides an exponential improvement over previous works and implies an exponential separation from the Littlestone dimension. We further introduce lower bounds for weak learners, i.e., learners that are only marginally better than random guessing. Lower bounds from previous works apply only to stronger learners. To offer a broader and more comprehensive view of our topological approach, we prove a local variant of the Borsuk-Ulam theorem in topology and a result in combinatorics concerning Kneser colorings. In combinatorics, we prove that if $c$ is a coloring of all non-empty subsets of $[n]$ such that disjoint sets have different colors, then there is a chain of subsets that receives at least $1+ \lfloor n/2\rfloor$ colors (this bound is sharp). In topology, we prove e.g. that for any open antipodal-free cover of the $d$-dimensional sphere, there is a point $x$ that belongs to at least $t=\lceil\frac{d+3}{2}\rceil$ sets.
    High-performance real-world optical computing trained by in situ model-free optimization. (arXiv:2307.11957v3 [physics.optics] UPDATED)
    Optical computing systems can provide high-speed and low-energy data processing but face deficiencies in computationally demanding training and simulation-to-reality gap. We propose a model-free solution for lightweight in situ optimization of optical computing systems based on the score gradient estimation algorithm. This approach treats the system as a black box and back-propagates loss directly to the optical weights' probabilistic distributions, hence circumventing the need for computation-heavy and biased system simulation. We demonstrate a superior classification accuracy on the MNIST and FMNIST datasets through experiments on a single-layer diffractive optical computing system. Furthermore, we show its potential for image-free and high-speed cell analysis. The inherent simplicity of our proposed method, combined with its low demand for computational resources, expedites the transition of optical computing from laboratory demonstrations to real-world applications.  ( 2 min )
    From SMOTE to Mixup for Deep Imbalanced Classification. (arXiv:2308.15457v2 [cs.LG] UPDATED)
    Given imbalanced data, it is hard to train a good classifier using deep learning because of the poor generalization of minority classes. Traditionally, the well-known synthetic minority oversampling technique (SMOTE) for data augmentation, a data mining approach for imbalanced learning, has been used to improve this generalization. However, it is unclear whether SMOTE also benefits deep learning. In this work, we study why the original SMOTE is insufficient for deep learning, and enhance SMOTE using soft labels. Connecting the resulting soft SMOTE with Mixup, a modern data augmentation technique, leads to a unified framework that puts traditional and modern data augmentation techniques under the same umbrella. A careful study within this framework shows that Mixup improves generalization by implicitly achieving uneven margins between majority and minority classes. We then propose a novel margin-aware Mixup technique that more explicitly achieves uneven margins. Extensive experimental results demonstrate that our proposed technique yields state-of-the-art performance on deep imbalanced classification while achieving superior performance on extremely imbalanced data. The code is open-sourced in our developed package https://github.com/ntucllab/imbalanced-DL to foster future research in this direction.
    Landscape Surrogate: Learning Decision Losses for Mathematical Optimization Under Partial Information. (arXiv:2307.08964v2 [cs.LG] UPDATED)
    Recent works in learning-integrated optimization have shown promise in settings where the optimization problem is only partially observed or where general-purpose optimizers perform poorly without expert tuning. By learning an optimizer $\mathbf{g}$ to tackle these challenging problems with $f$ as the objective, the optimization process can be substantially accelerated by leveraging past experience. The optimizer can be trained with supervision from known optimal solutions or implicitly by optimizing the compound function $f\circ \mathbf{g}$. The implicit approach may not require optimal solutions as labels and is capable of handling problem uncertainty; however, it is slow to train and deploy due to frequent calls to optimizer $\mathbf{g}$ during both training and testing. The training is further challenged by sparse gradients of $\mathbf{g}$, especially for combinatorial solvers. To address these challenges, we propose using a smooth and learnable Landscape Surrogate $M$ as a replacement for $f\circ \mathbf{g}$. This surrogate, learnable by neural networks, can be computed faster than the solver $\mathbf{g}$, provides dense and smooth gradients during training, can generalize to unseen optimization problems, and is efficiently learned via alternating optimization. We test our approach on both synthetic problems, including shortest path and multidimensional knapsack, and real-world problems such as portfolio optimization, achieving comparable or superior objective values compared to state-of-the-art baselines while reducing the number of calls to $\mathbf{g}$. Notably, our approach outperforms existing methods for computationally expensive high-dimensional problems.  ( 3 min )
    Adaptive Algorithms for Relaxed Pareto Set Identification. (arXiv:2307.00424v2 [stat.ML] UPDATED)
    In this paper we revisit the fixed-confidence identification of the Pareto optimal set in a multi-objective multi-armed bandit model. As the sample complexity to identify the exact Pareto set can be very large, a relaxation allowing to output some additional near-optimal arms has been studied. In this work we also tackle alternative relaxations that allow instead to identify a relevant subset of the Pareto set. Notably, we propose a single sampling strategy, called Adaptive Pareto Exploration, that can be used in conjunction with different stopping rules to take into account different relaxations of the Pareto Set Identification problem. We analyze the sample complexity of these different combinations, quantifying in particular the reduction in sample complexity that occurs when one seeks to identify at most $k$ Pareto optimal arms. We showcase the good practical performance of Adaptive Pareto Exploration on a real-world scenario, in which we adaptively explore several vaccination strategies against Covid-19 in order to find the optimal ones when multiple immunogenicity criteria are taken into account.  ( 2 min )
    Deconstructing Data Reconstruction: Multiclass, Weight Decay and General Losses. (arXiv:2307.01827v2 [cs.LG] UPDATED)
    Memorization of training data is an active research area, yet our understanding of the inner workings of neural networks is still in its infancy. Recently, Haim et al. (2022) proposed a scheme to reconstruct training samples from multilayer perceptron binary classifiers, effectively demonstrating that a large portion of training samples are encoded in the parameters of such networks. In this work, we extend their findings in several directions, including reconstruction from multiclass and convolutional neural networks. We derive a more general reconstruction scheme which is applicable to a wider range of loss functions such as regression losses. Moreover, we study the various factors that contribute to networks' susceptibility to such reconstruction schemes. Intriguingly, we observe that using weight decay during training increases reconstructability both in terms of quantity and quality. Additionally, we examine the influence of the number of neurons relative to the number of training samples on the reconstructability. Code: https://github.com/gonbuzaglo/decoreco  ( 2 min )
    MARRS: Multimodal Reference Resolution System. (arXiv:2311.01650v1 [cs.CL])
    Successfully handling context is essential for any dialog understanding task. This context maybe be conversational (relying on previous user queries or system responses), visual (relying on what the user sees, for example, on their screen), or background (based on signals such as a ringing alarm or playing music). In this work, we present an overview of MARRS, or Multimodal Reference Resolution System, an on-device framework within a Natural Language Understanding system, responsible for handling conversational, visual and background context. In particular, we present different machine learning models to enable handing contextual queries; specifically, one to enable reference resolution, and one to handle context via query rewriting. We also describe how these models complement each other to form a unified, coherent, lightweight system that can understand context while preserving user privacy.
    An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradient. (arXiv:2307.08873v3 [cs.LG] UPDATED)
    Restricting the variance of a policy's return is a popular choice in risk-averse Reinforcement Learning (RL) due to its clear mathematical definition and easy interpretability. Traditional methods directly restrict the total return variance. Recent methods restrict the per-step reward variance as a proxy. We thoroughly examine the limitations of these variance-based methods, such as sensitivity to numerical scale and hindering of policy learning, and propose to use an alternative risk measure, Gini deviation, as a substitute. We study various properties of this new risk measure and derive a policy gradient algorithm to minimize it. Empirical evaluation in domains where risk-aversion can be clearly defined, shows that our algorithm can mitigate the limitations of variance-based risk measures and achieves high return with low risk in terms of variance and Gini deviation when others fail to learn a reasonable policy.  ( 2 min )
    Improving Interpersonal Communication by Simulating Audiences with Language Models. (arXiv:2311.00687v2 [cs.AI] UPDATED)
    How do we communicate with others to achieve our goals? We use our prior experience or advice from others, or construct a candidate utterance by predicting how it will be received. However, our experiences are limited and biased, and reasoning about potential outcomes can be difficult and cognitively challenging. In this paper, we explore how we can leverage Large Language Model (LLM) simulations to help us communicate better. We propose the Explore-Generate-Simulate (EGS) framework, which takes as input any scenario where an individual is communicating to an audience with a goal they want to achieve. EGS (1) explores the solution space by producing a diverse set of advice relevant to the scenario, (2) generates communication candidates conditioned on subsets of the advice, and (3) simulates the reactions from various audiences to determine both the best candidate and advice to use. We evaluate the framework on eight scenarios spanning the ten fundamental processes of interpersonal communication. For each scenario, we collect a dataset of human evaluations across candidates and baselines, and showcase that our framework's chosen candidate is preferred over popular generation mechanisms including Chain-of-Thought. We also find that audience simulations achieve reasonably high agreement with human raters across 5 of the 8 scenarios. Finally, we demonstrate the generality of our framework by applying it to real-world scenarios described by users on web forums. Through evaluations and demonstrations, we show that EGS enhances the effectiveness and outcomes of goal-oriented communication across a variety of situations, thus opening up new possibilities for the application of large language models in revolutionizing communication and decision-making processes.
    Distill Gold from Massive Ores: Efficient Dataset Distillation via Critical Samples Selection. (arXiv:2305.18381v2 [cs.LG] UPDATED)
    Data-efficient learning has drawn significant attention, especially given the current trend of large multi-modal models, where dataset distillation can be an effective solution. However, the dataset distillation process itself is still very inefficient. In this work, we model the distillation problem with reference to information transport. Observing that severe data redundancy exists in dataset distillation, we argue to put more emphasis on the utility of the training samples. We propose a family of methods to exploit the most valuable samples, which is validated by our comprehensive analysis of the optimal data selection. The new strategy significantly reduces the training cost and extends a variety of existing distillation algorithms to larger and more diversified datasets, e.g., in some cases only 0.04% training data is sufficient for comparable distillation performance. Moreover, our strategy consistently enhances the performance, which may open up new analyses on the dynamics of distillation and networks. Our method is able to extend the distillation algorithms to much larger-scale datasets and more heterogeneous datasets, e.g., ImageNet-1K and Kinetics-400. Our code is available on https://github.com/silicx/GoldFromOres.
    MEDL-U: Uncertainty-aware 3D Automatic Annotation based on Evidential Deep Learning. (arXiv:2309.09599v2 [cs.CV] UPDATED)
    Advancements in deep learning-based 3D object detection necessitate the availability of large-scale datasets. However, this requirement introduces the challenge of manual annotation, which is often both burdensome and time-consuming. To tackle this issue, the literature has seen the emergence of several weakly supervised frameworks for 3D object detection which can automatically generate pseudo labels for unlabeled data. Nevertheless, these generated pseudo labels contain noise and are not as accurate as those labeled by humans. In this paper, we present the first approach that addresses the inherent ambiguities present in pseudo labels by introducing an Evidential Deep Learning (EDL) based uncertainty estimation framework. Specifically, we propose MEDL-U, an EDL framework based on MTrans, which not only generates pseudo labels but also quantifies the associated uncertainties. However, applying EDL to 3D object detection presents three primary challenges: (1) relatively lower pseudolabel quality in comparison to other autolabelers; (2) excessively high evidential uncertainty estimates; and (3) lack of clear interpretability and effective utilization of uncertainties for downstream tasks. We tackle these issues through the introduction of an uncertainty-aware IoU-based loss, an evidence-aware multi-task loss function, and the implementation of a post-processing stage for uncertainty refinement. Our experimental results demonstrate that probabilistic detectors trained using the outputs of MEDL-U surpass deterministic detectors trained using outputs from previous 3D annotators on the KITTI val set for all difficulty levels. Moreover, MEDL-U achieves state-of-the-art results on the KITTI official test set compared to existing 3D automatic annotators.
    Fractional Denoising for 3D Molecular Pre-training. (arXiv:2307.10683v2 [q-bio.QM] UPDATED)
    Coordinate denoising is a promising 3D molecular pre-training method, which has achieved remarkable performance in various downstream drug discovery tasks. Theoretically, the objective is equivalent to learning the force field, which is revealed helpful for downstream tasks. Nevertheless, there are two challenges for coordinate denoising to learn an effective force field, i.e. low coverage samples and isotropic force field. The underlying reason is that molecular distributions assumed by existing denoising methods fail to capture the anisotropic characteristic of molecules. To tackle these challenges, we propose a novel hybrid noise strategy, including noises on both dihedral angel and coordinate. However, denoising such hybrid noise in a traditional way is no more equivalent to learning the force field. Through theoretical deductions, we find that the problem is caused by the dependency of the input conformation for covariance. To this end, we propose to decouple the two types of noise and design a novel fractional denoising method (Frad), which only denoises the latter coordinate part. In this way, Frad enjoys both the merits of sampling more low-energy structures and the force field equivalence. Extensive experiments show the effectiveness of Frad in molecular representation, with a new state-of-the-art on 9 out of 12 tasks of QM9 and on 7 out of 8 targets of MD17.  ( 2 min )
    Universal Domain Adaptation from Foundation Models: A Baseline Study. (arXiv:2305.11092v2 [cs.LG] UPDATED)
    Foundation models (e.g., CLIP or DINOv2) have shown their impressive learning and transfer capabilities in a wide range of visual tasks, by training on a large corpus of data and adapting to specific downstream tasks. It is, however, interesting that foundation models have not been fully explored for universal domain adaptation (UniDA), which is to learn models using labeled data in a source domain and unlabeled data in a target one, such that the learned models can successfully adapt to the target data. In this paper, we make comprehensive empirical studies of state-of-the-art UniDA methods using foundation models. We first observe that, unlike fine-tuning from ImageNet pre-trained models, as previous methods do, fine-tuning from foundation models yields significantly poorer results, sometimes even worse than training from scratch. While freezing the backbones, we demonstrate that although the foundation models greatly improve the performance of the baseline method that trains the models on the source data alone, existing UniDA methods generally fail to improve over the baseline. This suggests that new research efforts are very necessary for UniDA using foundation models. Based on these findings, we introduce \textit{CLIP distillation}, a parameter-free method specifically designed to distill target knowledge from CLIP models. The core of our \textit{CLIP distillation} lies in a self-calibration technique for automatic temperature scaling, a feature that significantly enhances the baseline's out-class detection capability. Although simple, our method outperforms previous approaches in most benchmark tasks, excelling in evaluation metrics including H-score/H$^3$-score and the newly proposed universal classification rate (UCR) metric. We hope that our investigation and the proposed simple framework can serve as a strong baseline to facilitate future studies in this field.
    Adversarial Attacks against Binary Similarity Systems. (arXiv:2303.11143v2 [cs.CR] UPDATED)
    In recent years, binary analysis gained traction as a fundamental approach to inspect software and guarantee its security. Due to the exponential increase of devices running software, much research is now moving towards new autonomous solutions based on deep learning models, as they have been showing state-of-the-art performances in solving binary analysis problems. One of the hot topics in this context is binary similarity, which consists in determining if two functions in assembly code are compiled from the same source code. However, it is unclear how deep learning models for binary similarity behave in an adversarial context. In this paper, we study the resilience of binary similarity models against adversarial examples, showing that they are susceptible to both targeted and untargeted attacks (w.r.t. similarity goals) performed by black-box and white-box attackers. In more detail, we extensively test three current state-of-the-art solutions for binary similarity against two black-box greedy attacks, including a new technique that we call Spatial Greedy, and one white-box attack in which we repurpose a gradient-guided strategy used in attacks to image classifiers.
    Improving Lesion Segmentation in FDG-18 Whole-Body PET/CT scans using Multilabel approach: AutoPET II challenge. (arXiv:2311.01574v1 [eess.IV])
    Automatic segmentation of lesions in FDG-18 Whole Body (WB) PET/CT scans using deep learning models is instrumental for determining treatment response, optimizing dosimetry, and advancing theranostic applications in oncology. However, the presence of organs with elevated radiotracer uptake, such as the liver, spleen, brain, and bladder, often leads to challenges, as these regions are often misidentified as lesions by deep learning models. To address this issue, we propose a novel approach of segmenting both organs and lesions, aiming to enhance the performance of automatic lesion segmentation methods. In this study, we assessed the effectiveness of our proposed method using the AutoPET II challenge dataset, which comprises 1014 subjects. We evaluated the impact of inclusion of additional labels and data in the segmentation performance of the model. In addition to the expert-annotated lesion labels, we introduced eight additional labels for organs, including the liver, kidneys, urinary bladder, spleen, lung, brain, heart, and stomach. These labels were integrated into the dataset, and a 3D UNET model was trained within the nnUNet framework. Our results demonstrate that our method achieved the top ranking in the held-out test dataset, underscoring the potential of this approach to significantly improve lesion segmentation accuracy in FDG-18 Whole-Body PET/CT scans, ultimately benefiting cancer patients and advancing clinical practice.
    Hardness of Low Rank Approximation of Entrywise Transformed Matrix Products. (arXiv:2311.01960v1 [cs.DS])
    Inspired by fast algorithms in natural language processing, we study low rank approximation in the entrywise transformed setting where we want to find a good rank $k$ approximation to $f(U \cdot V)$, where $U, V^\top \in \mathbb{R}^{n \times r}$ are given, $r = O(\log(n))$, and $f(x)$ is a general scalar function. Previous work in sublinear low rank approximation has shown that if both (1) $U = V^\top$ and (2) $f(x)$ is a PSD kernel function, then there is an $O(nk^{\omega-1})$ time constant relative error approximation algorithm, where $\omega \approx 2.376$ is the exponent of matrix multiplication. We give the first conditional time hardness results for this problem, demonstrating that both conditions (1) and (2) are in fact necessary for getting better than $n^{2-o(1)}$ time for a relative error low rank approximation for a wide class of functions. We give novel reductions from the Strong Exponential Time Hypothesis (SETH) that rely on lower bounding the leverage scores of flat sparse vectors and hold even when the rank of the transformed matrix $f(UV)$ and the target rank are $n^{o(1)}$, and when $U = V^\top$. Furthermore, even when $f(x) = x^p$ is a simple polynomial, we give runtime lower bounds in the case when $U \neq V^\top$ of the form $\Omega(\min(n^{2-o(1)}, \Omega(2^p)))$. Lastly, we demonstrate that our lower bounds are tight by giving an $O(n \cdot \text{poly}(k, 2^p, 1/\epsilon))$ time relative error approximation algorithm and a fast $O(n \cdot \text{poly}(k, p, 1/\epsilon))$ additive error approximation using fast tensor-based sketching. Additionally, since our low rank algorithms rely on matrix-vector product subroutines, our lower bounds extend to show that computing $f(UV)W$, for even a small matrix $W$, requires $\Omega(n^{2-o(1)})$ time.
    Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos. (arXiv:2311.02076v1 [cs.LG])
    In gradient descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training. This includes early time regimes where the sharpness may decrease during early periods of training (sharpness reduction), and later time behavior such as progressive sharpening and edge of stability. We demonstrate that a simple $2$-layer linear network (UV model) trained on a single training example exhibits all of the essential sharpness phenomenology observed in real-world scenarios. By analyzing the structure of dynamical fixed points in function space and the vector field of function updates, we uncover the underlying mechanisms behind these sharpness trends. Our analysis reveals (i) the mechanism behind early sharpness reduction and progressive sharpening, (ii) the required conditions for edge of stability, and (iii) a period-doubling route to chaos on the edge of stability manifold as learning rate is increased. Finally, we demonstrate that various predictions from this simplified model generalize to real-world scenarios and discuss its limitations.
    GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling. (arXiv:2311.01927v1 [cs.LG])
    Linear Recurrence has proven to be a powerful tool for modeling long sequences efficiently. In this work, we show that existing models fail to take full advantage of its potential. Motivated by this finding, we develop GateLoop, a foundational sequence model that generalizes linear recurrent models such as S4, S5, LRU and RetNet, by employing data-controlled state transitions. Utilizing this theoretical advance, GateLoop empirically outperforms existing models for auto-regressive language modeling. Our method comes with a low-cost $O(l)$ recurrent mode and an efficient $O(l \log_{2} l)$ parallel mode making use of highly optimized associative scan implementations. Furthermore, we derive an $O(l^2)$ surrogate attention mode, revealing remarkable implications for Transformer and recently proposed architectures. Specifically, we prove that our approach can be interpreted as providing data-controlled relative-positional information to Attention. While many existing models solely rely on data-controlled cumulative sums for context aggregation, our findings suggest that incorporating data-controlled complex cumulative products may be a crucial step towards more powerful sequence models.
    To Stay or Not to Stay in the Pre-train Basin: Insights on Ensembling in Transfer Learning. (arXiv:2303.03374v2 [cs.LG] UPDATED)
    Transfer learning and ensembling are two popular techniques for improving the performance and robustness of neural networks. Due to the high cost of pre-training, ensembles of models fine-tuned from a single pre-trained checkpoint are often used in practice. Such models end up in the same basin of the loss landscape, which we call the pre-train basin, and thus have limited diversity. In this work, we show that ensembles trained from a single pre-trained checkpoint may be improved by better exploring the pre-train basin, however, leaving the basin results in losing the benefits of transfer learning and in degradation of the ensemble quality. Based on the analysis of existing exploration methods, we propose a more effective modification of the Snapshot Ensembles (SSE) for transfer learning setup, StarSSE, which results in stronger ensembles and uniform model soups.
    Solving Kernel Ridge Regression with Gradient Descent for a Non-Constant Kernel. (arXiv:2311.01762v1 [stat.ML])
    Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. The solution can be obtained either as a closed-form solution, which includes a matrix inversion, or iteratively through gradient descent. Using the iterative approach opens up for changing the kernel during training, something that is investigated in this paper. We theoretically address the effects this has on model complexity and generalization. Based on our findings, we propose an update scheme for the bandwidth of translational-invariant kernels, where we let the bandwidth decrease to zero during training, thus circumventing the need for hyper-parameter selection. We demonstrate on real and synthetic data how decreasing the bandwidth during training outperforms using a constant bandwidth, selected by cross-validation and marginal likelihood maximization. We also show theoretically and empirically that using a decreasing bandwidth, we are able to achieve both zero training error in combination with good generalization, and a double descent behavior, phenomena that do not occur for KRR with constant bandwidth but are known to appear for neural networks.
    On the Convergence of Encoder-only Shallow Transformers. (arXiv:2311.01575v1 [cs.LG])
    In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers under a realistic setting from the perspective of architectures, initialization, and scaling under a finite width regime. The difficulty lies in how to tackle the softmax in self-attention mechanism, the core ingredient of Transformer. In particular, we diagnose the scaling scheme, carefully tackle the input/output of softmax, and prove that quadratic overparameterization is sufficient for global convergence of our shallow Transformers under commonly-used He/LeCun initialization in practice. Besides, neural tangent kernel (NTK) based analysis is also given, which facilitates a comprehensive comparison. Our theory demonstrates the separation on the importance of different scaling schemes and initialization. We believe our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
    Learning to Augment Distributions for Out-of-Distribution Detection. (arXiv:2311.01796v1 [cs.LG])
    Open-world classification systems should discern out-of-distribution (OOD) data whose labels deviate from those of in-distribution (ID) cases, motivating recent studies in OOD detection. Advanced works, despite their promising progress, may still fail in the open world, owing to the lack of knowledge about unseen OOD data in advance. Although one can access auxiliary OOD data (distinct from unseen ones) for model training, it remains to analyze how such auxiliary data will work in the open world. To this end, we delve into such a problem from a learning theory perspective, finding that the distribution discrepancy between the auxiliary and the unseen real OOD data is the key to affecting the open-world detection performance. Accordingly, we propose Distributional-Augmented OOD Learning (DAL), alleviating the OOD distribution discrepancy by crafting an OOD distribution set that contains all distributions in a Wasserstein ball centered on the auxiliary OOD distribution. We justify that the predictor trained over the worst OOD data in the ball can shrink the OOD distribution discrepancy, thus improving the open-world detection performance given only the auxiliary OOD data. We conduct extensive evaluations across representative OOD detection setups, demonstrating the superiority of our DAL over its advanced counterparts.
    Better Fair than Sorry: Adversarial Missing Data Imputation for Fair GNNs. (arXiv:2311.01591v1 [cs.LG])
    This paper addresses the problem of learning fair Graph Neural Networks (GNNs) under missing protected attributes. GNNs have achieved state-of-the-art results in many relevant tasks where decisions might disproportionately impact specific communities. However, existing work on fair GNNs assumes that either protected attributes are fully-observed or that the missing data imputation is fair. In practice, biases in the imputation will be propagated to the model outcomes, leading them to overestimate the fairness of their predictions. We address this challenge by proposing Better Fair than Sorry (BFtS), a fair missing data imputation model for protected attributes used by fair GNNs. The key design principle behind BFtS is that imputations should approximate the worst-case scenario for the fair GNN -- i.e. when optimizing fairness is the hardest. We implement this idea using a 3-player adversarial scheme where two adversaries collaborate against the fair GNN. Experiments using synthetic and real datasets show that BFtS often achieves a better fairness $\times$ accuracy trade-off than existing alternatives.
    Phase transitions in nonparametric regressions. (arXiv:2112.03626v7 [math.ST] UPDATED)
    When the unknown regression function of a single variable is known to have derivatives up to the $(\gamma+1)$th order bounded in absolute values by a common constant everywhere or a.e. (i.e., $(\gamma+1)$th degree of smoothness), the minimax optimal rate of the mean integrated squared error (MISE) is stated as $\left(\frac{1}{n}\right)^{\frac{2\gamma+2}{2\gamma+3}}$ in the literature. This paper shows that: (i) if $n\leq\left(\gamma+1\right)^{2\gamma+3}$, the minimax optimal MISE rate is $\frac{\log n}{n\log(\log n)}$ and the optimal degree of smoothness to exploit is roughly $\max\left\{ \left\lfloor \frac{\log n}{2\log\left(\log n\right)}\right\rfloor ,\,1\right\} $; (ii) if $n>\left(\gamma+1\right)^{2\gamma+3}$, the minimax optimal MISE rate is $\left(\frac{1}{n}\right)^{\frac{2\gamma+2}{2\gamma+3}}$ and the optimal degree of smoothness to exploit is $\gamma+1$. The fundamental contribution of this paper is a set of metric entropy bounds we develop for smooth function classes. Some of our bounds are original, and some of them improve and/or generalize the ones in the literature (e.g., Kolmogorov and Tikhomirov, 1959). Our metric entropy bounds allow us to show phase transitions in the minimax optimal MISE rates associated with some commonly seen smoothness classes as well as non-standard smoothness classes, and can also be of independent interest outside the nonparametric regression problems.
    A Statistical Guarantee for Representation Transfer in Multitask Imitation Learning. (arXiv:2311.01589v1 [cs.LG])
    Transferring representation for multitask imitation learning has the potential to provide improved sample efficiency on learning new tasks, when compared to learning from scratch. In this work, we provide a statistical guarantee indicating that we can indeed achieve improved sample efficiency on the target task when a representation is trained using sufficiently diverse source tasks. Our theoretical results can be readily extended to account for commonly used neural network architectures with realistic assumptions. We conduct empirical analyses that align with our theoretical findings on four simulated environments$\unicode{x2014}$in particular leveraging more data from source tasks can improve sample efficiency on learning in the new task.
    Efficient Generalized Low-Rank Tensor Contextual Bandits. (arXiv:2311.01771v1 [cs.LG])
    In this paper, we aim to build a novel bandits algorithm that is capable of fully harnessing the power of multi-dimensional data and the inherent non-linearity of reward functions to provide high-usable and accountable decision-making services. To this end, we introduce a generalized low-rank tensor contextual bandits model in which an action is formed from three feature vectors, and thus can be represented by a tensor. In this formulation, the reward is determined through a generalized linear function applied to the inner product of the action's feature tensor and a fixed but unknown parameter tensor with a low tubal rank. To effectively achieve the trade-off between exploration and exploitation, we introduce a novel algorithm called "Generalized Low-Rank Tensor Exploration Subspace then Refine" (G-LowTESTR). This algorithm first collects raw data to explore the intrinsic low-rank tensor subspace information embedded in the decision-making scenario, and then converts the original problem into an almost lower-dimensional generalized linear contextual bandits problem. Rigorous theoretical analysis shows that the regret bound of G-LowTESTR is superior to those in vectorization and matricization cases. We conduct a series of simulations and real data experiments to further highlight the effectiveness of G-LowTESTR, leveraging its ability to capitalize on the low-rank tensor structure for enhanced learning.
    Adversary ML Resilience in Autonomous Driving Through Human Centered Perception Mechanisms. (arXiv:2311.01478v1 [cs.CV])
    Physical adversarial attacks on road signs are continuously exploiting vulnerabilities in modern day autonomous vehicles (AVs) and impeding their ability to correctly classify what type of road sign they encounter. Current models cannot generalize input data well, resulting in overfitting or underfitting. In overfitting, the model memorizes the input data but cannot generalize to new scenarios. In underfitting, the model does not learn enough of the input data to accurately classify these road signs. This paper explores the resilience of autonomous driving systems against three main physical adversarial attacks (tape, graffiti, illumination), specifically targeting object classifiers. Several machine learning models were developed and evaluated on two distinct datasets: road signs (stop signs, speed limit signs, traffic lights, and pedestrian crosswalk signs) and geometric shapes (octagons, circles, squares, and triangles). The study compared algorithm performance under different conditions, including clean and adversarial training and testing on these datasets. To build robustness against attacks, defense techniques like adversarial training and transfer learning were implemented. Results demonstrated transfer learning models played a crucial role in performance by allowing knowledge gained from shape training to improve generalizability of road sign classification, despite the datasets being completely different. The paper suggests future research directions, including human-in-the-loop validation, security analysis, real-world testing, and explainable AI for transparency. This study aims to contribute to improving security and robustness of object classifiers in autonomous vehicles and mitigating adversarial example impacts on driving systems.
    Learning Sparse Codes with Entropy-Based ELBOs. (arXiv:2311.01888v1 [stat.ML])
    Standard probabilistic sparse coding assumes a Laplace prior, a linear mapping from latents to observables, and Gaussian observable distributions. We here derive a solely entropy-based learning objective for the parameters of standard sparse coding. The novel variational objective has the following features: (A) unlike MAP approximations, it uses non-trivial posterior approximations for probabilistic inference; (B) unlike for previous non-trivial approximations, the novel objective is fully analytical; and (C) the objective allows for a novel principled form of annealing. The objective is derived by first showing that the standard ELBO objective converges to a sum of entropies, which matches similar recent results for generative models with Gaussian priors. The conditions under which the ELBO becomes equal to entropies are then shown to have analytical solutions, which leads to the fully analytical objective. Numerical experiments are used to demonstrate the feasibility of learning with such entropy-based ELBOs. We investigate different posterior approximations including Gaussians with correlated latents and deep amortized approximations. Furthermore, we numerically investigate entropy-based annealing which results in improved learning. Our main contributions are theoretical, however, and they are twofold: (1) for non-trivial posterior approximations, we provide the (to the knowledge of the authors) first analytical ELBO objective for standard probabilistic sparse coding; and (2) we provide the first demonstration on how a recently shown convergence of the ELBO to entropy sums can be used for learning.
    Optimistic Multi-Agent Policy Gradient for Cooperative Tasks. (arXiv:2311.01953v1 [cs.LG])
    \textit{Relative overgeneralization} (RO) occurs in cooperative multi-agent learning tasks when agents converge towards a suboptimal joint policy due to overfitting to suboptimal behavior of other agents. In early work, optimism has been shown to mitigate the \textit{RO} problem when using tabular Q-learning. However, with function approximation optimism can amplify overestimation and thus fail on complex tasks. On the other hand, recent deep multi-agent policy gradient (MAPG) methods have succeeded in many complex tasks but may fail with severe \textit{RO}. We propose a general, yet simple, framework to enable optimistic updates in MAPG methods and alleviate the RO problem. Specifically, we employ a \textit{Leaky ReLU} function where a single hyperparameter selects the degree of optimism to reshape the advantages when updating the policy. Intuitively, our method remains optimistic toward individual actions with lower returns which are potentially caused by other agents' sub-optimal behavior during learning. The optimism prevents the individual agents from quickly converging to a local optimum. We also provide a formal analysis from an operator view to understand the proposed advantage transformation. In extensive evaluations on diverse sets of tasks, including illustrative matrix games, complex \textit{Multi-agent MuJoCo} and \textit{Overcooked} benchmarks, the proposed method\footnote{Code can be found at \url{https://github.com/wenshuaizhao/optimappo}.} outperforms strong baselines on 13 out of 19 tested tasks and matches the performance on the rest.
    VQPy: An Object-Oriented Approach to Modern Video Analytics. (arXiv:2311.01623v1 [cs.CV])
    Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a frontend$\unicode{x2015}$a Python variant with constructs that make it easy for users to express video objects and their interactions$\unicode{x2015}$as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which has been productized in Cisco as part of its DeepVision framework.
    Maximum Likelihood Estimation of Flexible Survival Densities with Importance Sampling. (arXiv:2311.01660v1 [cs.LG])
    Survival analysis is a widely-used technique for analyzing time-to-event data in the presence of censoring. In recent years, numerous survival analysis methods have emerged which scale to large datasets and relax traditional assumptions such as proportional hazards. These models, while being performant, are very sensitive to model hyperparameters including: (1) number of bins and bin size for discrete models and (2) number of cluster assignments for mixture-based models. Each of these choices requires extensive tuning by practitioners to achieve optimal performance. In addition, we demonstrate in empirical studies that: (1) optimal bin size may drastically differ based on the metric of interest (e.g., concordance vs brier score), and (2) mixture models may suffer from mode collapse and numerical instability. We propose a survival analysis approach which eliminates the need to tune hyperparameters such as mixture assignments and bin sizes, reducing the burden on practitioners. We show that the proposed approach matches or outperforms baselines on several real-world datasets.
    Should Under-parameterized Student Networks Copy or Average Teacher Weights?. (arXiv:2311.01644v1 [cs.LG])
    Any continuous function $f^*$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$. We consider the case when $f^*$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f^*$ with a neural network with $n< k$ neurons can thus be seen as fitting an under-parameterized "student" network with $n$ neurons to a "teacher" network with $k$ neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when $n-1$ student neurons each copy one teacher neuron and the $n$-th student neuron averages the remaining $k-n+1$ teacher neurons. For the student network with $n=1$ neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.
    Sequential Subset Matching for Dataset Distillation. (arXiv:2311.01570v1 [cs.CV])
    Dataset distillation is a newly emerging task that synthesizes a small-size dataset used in training deep neural networks (DNNs) for reducing data storage and model training costs. The synthetic datasets are expected to capture the essence of the knowledge contained in real-world datasets such that the former yields a similar performance as the latter. Recent advancements in distillation methods have produced notable improvements in generating synthetic datasets. However, current state-of-the-art methods treat the entire synthetic dataset as a unified entity and optimize each synthetic instance equally. This static optimization approach may lead to performance degradation in dataset distillation. Specifically, we argue that static optimization can give rise to a coupling issue within the synthetic data, particularly when a larger amount of synthetic data is being optimized. This coupling issue, in turn, leads to the failure of the distilled dataset to extract the high-level features learned by the deep neural network (DNN) in the latter epochs. In this study, we propose a new dataset distillation strategy called Sequential Subset Matching (SeqMatch), which tackles this problem by adaptively optimizing the synthetic data to encourage sequential acquisition of knowledge during dataset distillation. Our analysis indicates that SeqMatch effectively addresses the coupling issue by sequentially generating the synthetic instances, thereby enhancing its performance significantly. Our proposed SeqMatch outperforms state-of-the-art methods in various datasets, including SVNH, CIFAR-10, CIFAR-100, and Tiny ImageNet. Our code is available at https://github.com/shqii1j/seqmatch.
    Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula. (arXiv:2311.01642v1 [cs.LG])
    Robustness against adversarial attacks and distribution shifts is a long-standing goal of Reinforcement Learning (RL). To this end, Robust Adversarial Reinforcement Learning (RARL) trains a protagonist against destabilizing forces exercised by an adversary in a competitive zero-sum Markov game, whose optimal solution, i.e., rational strategy, corresponds to a Nash equilibrium. However, finding Nash equilibria requires facing complex saddle point optimization problems, which can be prohibitive to solve, especially for high-dimensional control. In this paper, we propose a novel approach for adversarial RL based on entropy regularization to ease the complexity of the saddle point optimization problem. We show that the solution of this entropy-regularized problem corresponds to a Quantal Response Equilibrium (QRE), a generalization of Nash equilibria that accounts for bounded rationality, i.e., agents sometimes play random actions instead of optimal ones. Crucially, the connection between the entropy-regularized objective and QRE enables free modulation of the rationality of the agents by simply tuning the temperature coefficient. We leverage this insight to propose our novel algorithm, Quantal Adversarial RL (QARL), which gradually increases the rationality of the adversary in a curriculum fashion until it is fully rational, easing the complexity of the optimization problem while retaining robustness. We provide extensive evidence of QARL outperforming RARL and recent baselines across several MuJoCo locomotion and navigation problems in overall performance and robustness.
    Variable Selection in Maximum Mean Discrepancy for Interpretable Distribution Comparison. (arXiv:2311.01537v1 [stat.ML])
    Two-sample testing decides whether two datasets are generated from the same distribution. This paper studies variable selection for two-sample testing, the task being to identify the variables (or dimensions) responsible for the discrepancies between the two distributions. This task is relevant to many problems of pattern analysis and machine learning, such as dataset shift adaptation, causal inference and model validation. Our approach is based on a two-sample test based on the Maximum Mean Discrepancy (MMD). We optimise the Automatic Relevance Detection (ARD) weights defined for individual variables to maximise the power of the MMD-based test. For this optimisation, we introduce sparse regularisation and propose two methods for dealing with the issue of selecting an appropriate regularisation parameter. One method determines the regularisation parameter in a data-driven way, and the other aggregates the results of different regularisation parameters. We confirm the validity of the proposed methods by systematic comparisons with baseline methods, and demonstrate their usefulness in exploratory analysis of high-dimensional traffic simulation data. Preliminary theoretical analyses are also provided, including a rigorous definition of variable selection for two-sample testing.
    Score Models for Offline Goal-Conditioned Reinforcement Learning. (arXiv:2311.02013v1 [cs.LG])
    Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions. Offline GCRL is pivotal for developing generalist agents capable of leveraging pre-existing datasets to learn diverse and reusable skills without hand-engineering reward functions. However, contemporary approaches to GCRL based on supervised learning and contrastive learning are often suboptimal in the offline setting. An alternative perspective on GCRL optimizes for occupancy matching, but necessitates learning a discriminator, which subsequently serves as a pseudo-reward for downstream RL. Inaccuracies in the learned discriminator can cascade, negatively influencing the resulting policy. We present a novel approach to GCRL under a new lens of mixture-distribution matching, leading to our discriminator-free method: SMORe. The key insight is combining the occupancy matching perspective of GCRL with a convex dual formulation to derive a learning objective that can better leverage suboptimal offline data. SMORe learns scores or unnormalized densities representing the importance of taking an action at a state for reaching a particular goal. SMORe is principled and our extensive experiments on the fully offline GCRL benchmark composed of robot manipulation and locomotion tasks, including high-dimensional observations, show that SMORe can outperform state-of-the-art baselines by a significant margin.
    Deep Learning for blind spectral unmixing of LULC classes with MODIS multispectral time series and ancillary data. (arXiv:2310.07223v2 [cs.CV] UPDATED)
    Remotely sensed data are dominated by mixed Land Use and Land Cover (LULC) types. Spectral unmixing is a technique to extract information from mixed pixels into their constituent LULC types and corresponding abundance fractions. Traditionally, solving this task has relied on either classical methods that require prior knowledge of endmembers or machine learning methods that avoid explicit endmembers calculation, also known as blind spectral unmixing (BSU). Most BSU studies based on Deep Learning (DL) focus on one time-step hyperspectral or multispectral data. To our knowledge, here we provide the first study on BSU of LULC classes using MODIS multispectral time series, in presence of missing data, with end-to-end DL models. We further boost the performance of a Long-Short Term Memory (LSTM)-based model by incorporating geographic plus topographic (geo-topographic) and climatic ancillary information. Our experiments show that combining spectral-temporal input data together with geo-topographic and climatic information substantially improves the abundance estimation of LULC classes in mixed pixels. To carry out this study, we built a new labeled dataset of the region of Andalusia (Spain) with monthly multispectral time series of pixels for the year 2013 from MODIS at 460m resolution, for two hierarchical levels of LULC classes, named Andalusia MultiSpectral MultiTemporal Unmixing (Andalusia-MSMTU). This dataset provides, at the pixel level, a multispectral time series plus ancillary information annotated with the abundance of each LULC class inside each pixel. The dataset (https://zenodo.org/record/7752348##.ZBmkkezMLdo) and code (https://github.com/jrodriguezortega/MSMTU) are available to the public.  ( 3 min )
    Cost-aware Generalized $\alpha$-investing for Multiple Hypothesis Testing. (arXiv:2210.17514v3 [cs.LG] UPDATED)
    We consider the problem of sequential multiple hypothesis testing with nontrivial data collection costs. This problem appears, for example, when conducting biological experiments to identify differentially expressed genes of a disease process. This work builds on the generalized $\alpha$-investing framework which enables control of the false discovery rate in a sequential testing setting. We make a theoretical analysis of the long term asymptotic behavior of $\alpha$-wealth which motivates a consideration of sample size in the $\alpha$-investing decision rule. Posing the testing process as a game with nature, we construct a decision rule that optimizes the expected $\alpha$-wealth reward (ERO) and provides an optimal sample size for each test. Empirical results show that a cost-aware ERO decision rule correctly rejects more false null hypotheses than other methods for $n=1$ where $n$ is the sample size. When the sample size is not fixed cost-aware ERO uses a prior on the null hypothesis to adaptively allocate of the sample budget to each test. We extend cost-aware ERO investing to finite-horizon testing which enables the decision rule to allocate samples in a non-myopic manner. Finally, empirical tests on real data sets from biological experiments show that cost-aware ERO balances the allocation of samples to an individual test against the allocation of samples across multiple tests.
    Sketching for Convex and Nonconvex Regularized Least Squares with Sharp Guarantees. (arXiv:2311.01806v1 [math.OC])
    Randomized algorithms are important for solving large-scale optimization problems. In this paper, we propose a fast sketching algorithm for least square problems regularized by convex or nonconvex regularization functions, Sketching for Regularized Optimization (SRO). Our SRO algorithm first generates a sketch of the original data matrix, then solves the sketched problem. Different from existing randomized algorithms, our algorithm handles general Frechet subdifferentiable regularization functions in an unified framework. We present general theoretical result for the approximation error between the optimization results of the original problem and the sketched problem for regularized least square problems which can be convex or nonconvex. For arbitrary convex regularizer, relative-error bound is proved for the approximation error. Importantly, minimax rates for sparse signal estimation by solving the sketched sparse convex or nonconvex learning problems are also obtained using our general theoretical result under mild conditions. To the best of our knowledge, our results are among the first to demonstrate minimax rates for convex or nonconvex sparse learning problem by sketching under a unified theoretical framework. We further propose an iterative sketching algorithm which reduces the approximation error exponentially by iteratively invoking the sketching algorithm. Experimental results demonstrate the effectiveness of the proposed SRO and Iterative SRO algorithms.
    On the Generalization Properties of Diffusion Models. (arXiv:2311.01797v1 [cs.LG])
    Diffusion models are a class of generative models that serve to establish a stochastic transport map between an empirically observed, yet unknown, target distribution and a known prior. Despite their remarkable success in real-world applications, a theoretical understanding of their generalization capabilities remains underdeveloped. This work embarks on a comprehensive theoretical exploration of the generalization attributes of diffusion models. We establish theoretical estimates of the generalization gap that evolves in tandem with the training dynamics of score-based diffusion models, suggesting a polynomially small generalization error ($O(n^{-2/5}+m^{-4/5})$) on both the sample size $n$ and the model capacity $m$, evading the curse of dimensionality (i.e., not exponentially large in the data dimension) when early-stopped. Furthermore, we extend our quantitative analysis to a data-dependent scenario, wherein target distributions are portrayed as a succession of densities with progressively increasing distances between modes. This precisely elucidates the adverse effect of "modes shift" in ground truths on the model generalization. Moreover, these estimates are not solely theoretical constructs but have also been confirmed through numerical simulations. Our findings contribute to the rigorous understanding of diffusion models' generalization properties and provide insights that may guide practical applications.
    Graph Neural Diffusion Networks for Semi-supervised Learning. (arXiv:2201.09698v2 [cs.LG] UPDATED)
    Graph Convolutional Networks (GCN) is a pioneering model for graph-based semi-supervised learning. However, GCN does not perform well on sparsely-labeled graphs. Its two-layer version cannot effectively propagate the label information to the whole graph structure (i.e., the under-smoothing problem) while its deep version over-smoothens and is hard to train (i.e., the over-smoothing problem). To solve these two issues, we propose a new graph neural network called GND-Nets (for Graph Neural Diffusion Networks) that exploits the local and global neighborhood information of a vertex in a single layer. Exploiting the shallow network mitigates the over-smoothing problem while exploiting the local and global neighborhood information mitigates the under-smoothing problem. The utilization of the local and global neighborhood information of a vertex is achieved by a new graph diffusion method called neural diffusions, which integrate neural networks into the conventional linear and nonlinear graph diffusions. The adoption of neural networks makes neural diffusions adaptable to different datasets. Extensive experiments on various sparsely-labeled graphs verify the effectiveness and efficiency of GND-Nets compared to state-of-the-art approaches.  ( 2 min )
    Object-Centric Slot Diffusion. (arXiv:2303.10834v5 [cs.CV] UPDATED)
    The recent success of transformer-based image generative models in object-centric learning highlights the importance of powerful image generators for handling complex scenes. However, despite the high expressiveness of diffusion models in image generation, their integration into object-centric learning remains largely unexplored in this domain. In this paper, we explore the feasibility and potential of integrating diffusion models into object-centric learning and investigate the pros and cons of this approach. We introduce Latent Slot Diffusion (LSD), a novel model that serves dual purposes: it is the first object-centric learning model to replace conventional slot decoders with a latent diffusion model conditioned on object slots, and it is also the first unsupervised compositional conditional diffusion model that operates without the need for supervised annotations like text. Through experiments on various object-centric tasks, including the first application of the FFHQ dataset in this field, we demonstrate that LSD significantly outperforms state-of-the-art transformer-based decoders, particularly in more complex scenes, and exhibits superior unsupervised compositional generation quality. In addition, we conduct a preliminary investigation into the integration of pre-trained diffusion models in LSD and demonstrate its effectiveness in real-world image segmentation and generation. Project page is available at https://latentslotdiffusion.github.io  ( 2 min )
    GRANDE: Gradient-Based Decision Tree Ensembles. (arXiv:2309.17130v2 [cs.LG] UPDATED)
    Despite the success of deep learning for text and image data, tree-based ensemble models are still state-of-the-art for machine learning with heterogeneous tabular data. However, there is a significant need for tabular-specific gradient-based methods due to their high flexibility. In this paper, we propose $\text{GRANDE}$, $\text{GRA}$die$\text{N}$t-Based $\text{D}$ecision Tree $\text{E}$nsembles, a novel approach for learning hard, axis-aligned decision tree ensembles using end-to-end gradient descent. GRANDE is based on a dense representation of tree ensembles, which affords to use backpropagation with a straight-through operator to jointly optimize all model parameters. Our method combines axis-aligned splits, which is a useful inductive bias for tabular data, with the flexibility of gradient-based optimization. Furthermore, we introduce an advanced instance-wise weighting that facilitates learning representations for both, simple and complex relations, within a single model. We conducted an extensive evaluation on a predefined benchmark with 19 classification datasets and demonstrate that our method outperforms existing gradient-boosting and deep learning frameworks on most datasets. The method is available under: https://github.com/s-marton/GRANDE  ( 2 min )
    Modality Cycles with Masked Conditional Diffusion for Unsupervised Anomaly Segmentation in MRI. (arXiv:2308.16150v3 [eess.IV] UPDATED)
    Unsupervised anomaly segmentation aims to detect patterns that are distinct from any patterns processed during training, commonly called abnormal or out-of-distribution patterns, without providing any associated manual segmentations. Since anomalies during deployment can lead to model failure, detecting the anomaly can enhance the reliability of models, which is valuable in high-risk domains like medical imaging. This paper introduces Masked Modality Cycles with Conditional Diffusion (MMCCD), a method that enables segmentation of anomalies across diverse patterns in multimodal MRI. The method is based on two fundamental ideas. First, we propose the use of cyclic modality translation as a mechanism for enabling abnormality detection. Image-translation models learn tissue-specific modality mappings, which are characteristic of tissue physiology. Thus, these learned mappings fail to translate tissues or image patterns that have never been encountered during training, and the error enables their segmentation. Furthermore, we combine image translation with a masked conditional diffusion model, which attempts to `imagine' what tissue exists under a masked area, further exposing unknown patterns as the generative model fails to recreate them. We evaluate our method on a proxy task by training on healthy-looking slices of BraTS2021 multi-modality MRIs and testing on slices with tumors. We show that our method compares favorably to previous unsupervised approaches based on image reconstruction and denoising with autoencoders and diffusion models.  ( 3 min )
    Algorithm Selection for Deep Active Learning with Imbalanced Datasets. (arXiv:2302.07317v3 [cs.LG] UPDATED)
    Label efficiency has become an increasingly important objective in deep learning applications. Active learning aims to reduce the number of labeled examples needed to train deep networks, but the empirical performance of active learning algorithms can vary dramatically across datasets and applications. It is difficult to know in advance which active learning strategy will perform well or best in a given application. To address this, we propose the first adaptive algorithm selection strategy for deep active learning. For any unlabeled dataset, our (meta) algorithm TAILOR (Thompson ActIve Learning algORithm selection) iteratively and adaptively chooses among a set of candidate active learning algorithms. TAILOR uses novel reward functions aimed at gathering class-balanced examples. Extensive experiments in multi-class and multi-label applications demonstrate TAILOR's effectiveness in achieving accuracy comparable or better than that of the best of the candidate algorithms. Our implementation of TAILOR is open-sourced at https://github.com/jifanz/TAILOR.  ( 2 min )
    Allocating Divisible Resources on Arms with Unknown and Random Rewards. (arXiv:2306.16578v2 [cs.LG] UPDATED)
    We consider a decision maker allocating one unit of renewable and divisible resource in each period on a number of arms. The arms have unknown and random rewards whose means are proportional to the allocated resource and whose variances are proportional to an order $b$ of the allocated resource. In particular, if the decision maker allocates resource $A_i$ to arm $i$ in a period, then the reward $Y_i$ is$Y_i(A_i)=A_i \mu_i+A_i^b \xi_{i}$, where $\mu_i$ is the unknown mean and the noise $\xi_{i}$ is independent and sub-Gaussian. When the order $b$ ranges from 0 to 1, the framework smoothly bridges the standard stochastic multi-armed bandit and online learning with full feedback. We design two algorithms that attain the optimal gap-dependent and gap-independent regret bounds for $b\in [0,1]$, and demonstrate a phase transition at $b=1/2$. The theoretical results hinge on a novel concentration inequality we have developed that bounds a linear combination of sub-Gaussian random variables whose weights are fractional, adapted to the filtration, and monotonic.  ( 2 min )
    Disentangled Representation Learning with Transmitted Information Bottleneck. (arXiv:2311.01686v1 [cs.CV])
    Encoding only the task-related information from the raw data, \ie, disentangled representation learning, can greatly contribute to the robustness and generalizability of models. Although significant advances have been made by regularizing the information in representations with information theory, two major challenges remain: 1) the representation compression inevitably leads to performance drop; 2) the disentanglement constraints on representations are in complicated optimization. To these issues, we introduce Bayesian networks with transmitted information to formulate the interaction among input and representations during disentanglement. Building upon this framework, we propose \textbf{DisTIB} (\textbf{T}ransmitted \textbf{I}nformation \textbf{B}ottleneck for \textbf{Dis}entangled representation learning), a novel objective that navigates the balance between information compression and preservation. We employ variational inference to derive a tractable estimation for DisTIB. This estimation can be simply optimized via standard gradient descent with a reparameterization trick. Moreover, we theoretically prove that DisTIB can achieve optimal disentanglement, underscoring its superior efficacy. To solidify our claims, we conduct extensive experiments on various downstream tasks to demonstrate the appealing efficacy of DisTIB and validate our theoretical analyses.  ( 2 min )
    On some limitations of data-driven weather forecasting models. (arXiv:2309.08473v2 [stat.ML] UPDATED)
    As in many other areas of engineering and applied science, Machine Learning (ML) is having a profound impact in the domain of Weather and Climate Prediction. A very recent development in this area has been the emergence of fully data-driven ML prediction models which routinely claim superior performance to that of traditional physics-based models. In this work, we examine some aspects of the forecasts produced by an exemplar of the current generation of ML models, Pangu-Weather, with a focus on the fidelity and physical consistency of those forecasts and how these characteristics relate to perceived forecast performance. The main conclusion is that Pangu-Weather forecasts, and possibly those of similar ML models, do not have the fidelity and physical consistency of physics-based models and their advantage in accuracy on traditional deterministic metrics of forecast skill can be at least partly attributed to these peculiarities. Balancing forecast skill and physical consistency of ML-driven predictions will be an important consideration for future ML models. However, and similarly to other modern post-processing technologies, the current ML models appear to be already able to add value to standard NWP output for specific forecast applications and combined with their extremely low computational cost during deployment, are set to provide an additional, useful source of forecast information. .  ( 2 min )
    Fairness Improvement with Multiple Protected Attributes: How Far Are We?. (arXiv:2308.01923v2 [cs.LG] UPDATED)
    Existing research mostly improves the fairness of Machine Learning (ML) software regarding a single protected attribute at a time, but this is unrealistic given that many users have multiple protected attributes. This paper conducts an extensive study of fairness improvement regarding multiple protected attributes, covering 11 state-of-the-art fairness improvement methods. We analyze the effectiveness of these methods with different datasets, metrics, and ML models when considering multiple protected attributes. The results reveal that improving fairness for a single protected attribute can largely decrease fairness regarding unconsidered protected attributes. This decrease is observed in up to 88.3% of scenarios (57.5% on average). More surprisingly, we find little difference in accuracy loss when considering single and multiple protected attributes, indicating that accuracy can be maintained in the multiple-attribute paradigm. However, the effect on precision and recall when handling multiple protected attributes is about 5 times and 8 times that of a single attribute. This has important implications for future fairness research: reporting only accuracy as the ML performance metric, which is currently common in the literature, is inadequate.  ( 2 min )
    Tracr: Compiled Transformers as a Laboratory for Interpretability. (arXiv:2301.05062v5 [cs.LG] UPDATED)
    We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study "superposition" in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods. Commonly, because the "programs" learned by transformers are unknown it is unclear whether an interpretation succeeded. We demonstrate our approach by implementing and examining programs including computing token frequencies, sorting, and parenthesis checking. We provide an open-source implementation of Tracr at https://github.com/google-deepmind/tracr.  ( 2 min )
    Bayesian learning of feature spaces for multitasks problems. (arXiv:2209.03028v2 [stat.ML] UPDATED)
    This paper introduces a novel approach for multi-task regression that connects Kernel Machines (KMs) and Extreme Learning Machines (ELMs) through the exploitation of the Random Fourier Features (RFFs) approximation of the RBF kernel. In this sense, one of the contributions of this paper shows that for the proposed models, the KM and the ELM formulations can be regarded as two sides of the same coin. These proposed models, termed RFF-BLR, stand on a Bayesian framework that simultaneously addresses two main design goals. On the one hand, it fits multitask regressors based on KMs endowed with RBF kernels. On the other hand, it enables the introduction of a common-across-tasks prior that promotes multioutput sparsity in the ELM view. This Bayesian approach facilitates the simultaneous consideration of both the KM and ELM perspectives enabling (i) the optimisation of the RBF kernel parameter $\gamma$ within a probabilistic framework, (ii) the optimisation of the model complexity, and (iii) an efficient transfer of knowledge across tasks. The experimental results show that this framework can lead to significant performance improvements compared to the state-of-the-art methods in multitask nonlinear regression.  ( 2 min )
    Communication-Efficient Federated Non-Linear Bandit Optimization. (arXiv:2311.01695v1 [cs.LG])
    Federated optimization studies the problem of collaborative function optimization among multiple clients (e.g. mobile devices or organizations) under the coordination of a central server. Since the data is collected separately by each client and always remains decentralized, federated optimization preserves data privacy and allows for large-scale computing, which makes it a promising decentralized machine learning paradigm. Though it is often deployed for tasks that are online in nature, e.g., next-word prediction on keyboard apps, most works formulate it as an offline problem. The few exceptions that consider federated bandit optimization are limited to very simplistic function classes, e.g., linear, generalized linear, or non-parametric function class with bounded RKHS norm, which severely hinders its practical usage. In this paper, we propose a new algorithm, named Fed-GO-UCB, for federated bandit optimization with generic non-linear objective function. Under some mild conditions, we rigorously prove that Fed-GO-UCB is able to achieve sub-linear rate for both cumulative regret and communication cost. At the heart of our theoretical analysis are distributed regression oracle and individual confidence set construction, which can be of independent interests. Empirical evaluations also demonstrate the effectiveness of the proposed algorithm.  ( 2 min )
    Fine-Tuning Language Models with Advantage-Induced Policy Alignment. (arXiv:2306.02231v3 [cs.CL] UPDATED)
    Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most widely used methods. Despite its popularity, however, PPO may suffer from mode collapse, instability, and poor sample efficiency. We show that these issues can be alleviated by a novel algorithm that we refer to as Advantage-Induced Policy Alignment (APA), which leverages a squared error loss function based on the estimated advantages. We demonstrate empirically that APA consistently outperforms PPO in language tasks by a large margin, when a separate reward model is employed as the evaluator. In addition, compared with PPO, APA offers a more stable form of control over the deviation from the model's initial policy, ensuring that the model improves its performance without collapsing to deterministic output. In addition to empirical results, we also provide a theoretical justification supporting the design of our loss function.  ( 2 min )
  • Open

    Bayesian Quantile Regression with Subset Selection: A Posterior Summarization Perspective. (arXiv:2311.02043v1 [stat.ME])
    Quantile regression is a powerful tool for inferring how covariates affect specific percentiles of the response distribution. Existing methods either estimate conditional quantiles separately for each quantile of interest or estimate the entire conditional distribution using semi- or non-parametric models. The former often produce inadequate models for real data and do not share information across quantiles, while the latter are characterized by complex and constrained models that can be difficult to interpret and computationally inefficient. Further, neither approach is well-suited for quantile-specific subset selection. Instead, we pose the fundamental problems of linear quantile estimation, uncertainty quantification, and subset selection from a Bayesian decision analysis perspective. For any Bayesian regression model, we derive optimal and interpretable linear estimates and uncertainty quantification for each model-based conditional quantile. Our approach introduces a quantile-focused squared error loss, which enables efficient, closed-form computing and maintains a close relationship with Wasserstein-based density estimation. In an extensive simulation study, our methods demonstrate substantial gains in quantile estimation accuracy, variable selection, and inference over frequentist and Bayesian competitors. We apply these tools to identify the quantile-specific impacts of social and environmental stressors on educational outcomes for a large cohort of children in North Carolina.
    Multilayer hypergraph clustering using the aggregate similarity matrix. (arXiv:2301.11657v3 [math.ST] UPDATED)
    We consider the community recovery problem on a multilayer variant of the hypergraph stochastic block model (HSBM). Each layer is associated with an independent realization of a d-uniform HSBM on N vertices. Given the similarity matrix containing the aggregated number of hyperedges incident to each pair of vertices, the goal is to obtain a partition of the N vertices into disjoint communities. In this work, we investigate a semidefinite programming (SDP) approach and obtain information-theoretic conditions on the model parameters that guarantee exact recovery both in the assortative and the disassortative cases.
    Differentially Private Topological Data Analysis. (arXiv:2305.03609v2 [stat.ML] UPDATED)
    This paper is the first to attempt differentially private (DP) topological data analysis (TDA), producing near-optimal private persistence diagrams. We analyze the sensitivity of persistence diagrams in terms of the bottleneck distance, and we show that the commonly used \v{C}ech complex has sensitivity that does not decrease as the sample size $n$ increases. This makes it challenging for the persistence diagrams of \v{C}ech complexes to be privatized. As an alternative, we show that the persistence diagram obtained by the $L^1$-distance to measure (DTM) has sensitivity $O(1/n)$. Based on the sensitivity analysis, we propose using the exponential mechanism whose utility function is defined in terms of the bottleneck distance of the $L^1$-DTM persistence diagrams. We also derive upper and lower bounds of the accuracy of our privacy mechanism; the obtained bounds indicate that the privacy error of our mechanism is near-optimal. We demonstrate the performance of our privatized persistence diagrams through simulations as well as on a real dataset tracking human movement.
    Doubly Robust Self-Training. (arXiv:2306.00265v3 [cs.LG] UPDATED)
    Self-training is an important technique for solving semi-supervised learning problems. It leverages unlabeled data by generating pseudo-labels and combining them with a limited labeled dataset for training. The effectiveness of self-training heavily relies on the accuracy of these pseudo-labels. In this paper, we introduce doubly robust self-training, a novel semi-supervised algorithm that provably balances between two extremes. When the pseudo-labels are entirely incorrect, our method reduces to a training process solely using labeled data. Conversely, when the pseudo-labels are completely accurate, our method transforms into a training process utilizing all pseudo-labeled data and labeled data, thus increasing the effective sample size. Through empirical evaluations on both the ImageNet dataset for image classification and the nuScenes autonomous driving dataset for 3D object detection, we demonstrate the superiority of the doubly robust loss over the standard self-training baseline.
    Adaptive Algorithms for Relaxed Pareto Set Identification. (arXiv:2307.00424v2 [stat.ML] UPDATED)
    In this paper we revisit the fixed-confidence identification of the Pareto optimal set in a multi-objective multi-armed bandit model. As the sample complexity to identify the exact Pareto set can be very large, a relaxation allowing to output some additional near-optimal arms has been studied. In this work we also tackle alternative relaxations that allow instead to identify a relevant subset of the Pareto set. Notably, we propose a single sampling strategy, called Adaptive Pareto Exploration, that can be used in conjunction with different stopping rules to take into account different relaxations of the Pareto Set Identification problem. We analyze the sample complexity of these different combinations, quantifying in particular the reduction in sample complexity that occurs when one seeks to identify at most $k$ Pareto optimal arms. We showcase the good practical performance of Adaptive Pareto Exploration on a real-world scenario, in which we adaptively explore several vaccination strategies against Covid-19 in order to find the optimal ones when multiple immunogenicity criteria are taken into account.
    Long Sequence Hopfield Memory. (arXiv:2306.04532v2 [cs.NE] CROSS LISTED)
    Sequence memory is an essential attribute of natural and artificial intelligence that enables agents to encode, store, and retrieve complex sequences of stimuli and actions. Computational models of sequence memory have been proposed where recurrent Hopfield-like neural networks are trained with temporally asymmetric Hebbian rules. However, these networks suffer from limited sequence capacity (maximal length of the stored sequence) due to interference between the memories. Inspired by recent work on Dense Associative Memories, we expand the sequence capacity of these models by introducing a nonlinear interaction term, enhancing separation between the patterns. We derive novel scaling laws for sequence capacity with respect to network size, significantly outperforming existing scaling laws for models based on traditional Hopfield networks, and verify these theoretical results with numerical simulation. Moreover, we introduce a generalized pseudoinverse rule to recall sequences of highly correlated patterns. Finally, we extend this model to store sequences with variable timing between states' transitions and describe a biologically-plausible implementation, with connections to motor neuroscience.
    Recurrent Neural-Linear Posterior Sampling for Nonstationary Contextual Bandits. (arXiv:2007.04750v2 [cs.LG] UPDATED)
    An agent in a nonstationary contextual bandit problem should balance between exploration and the exploitation of (periodic or structured) patterns present in its previous experiences. Handcrafting an appropriate historical context is an attractive alternative to transform a nonstationary problem into a stationary problem that can be solved efficiently. However, even a carefully designed historical context may introduce spurious relationships or lack a convenient representation of crucial information. In order to address these issues, we propose an approach that learns to represent the relevant context for a decision based solely on the raw history of interactions between the agent and the environment. This approach relies on a combination of features extracted by recurrent neural networks with a contextual linear bandit algorithm based on posterior sampling. Our experiments on a diverse selection of contextual and noncontextual nonstationary problems show that our recurrent approach consistently outperforms its feedforward counterpart, which requires handcrafted historical contexts, while being more widely applicable than conventional nonstationary bandit algorithms. Although it is very difficult to provide theoretical performance guarantees for our new approach, we also prove a novel regret bound for linear posterior sampling with measurement error that may serve as a foundation for future theoretical work.
    Tracr: Compiled Transformers as a Laboratory for Interpretability. (arXiv:2301.05062v5 [cs.LG] UPDATED)
    We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study "superposition" in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods. Commonly, because the "programs" learned by transformers are unknown it is unclear whether an interpretation succeeded. We demonstrate our approach by implementing and examining programs including computing token frequencies, sorting, and parenthesis checking. We provide an open-source implementation of Tracr at https://github.com/google-deepmind/tracr.
    Latent Diffusion Model for Conditional Reservoir Facies Generation. (arXiv:2311.01968v1 [physics.geo-ph])
    Creating accurate and geologically realistic reservoir facies based on limited measurements is crucial for field development and reservoir management, especially in the oil and gas sector. Traditional two-point geostatistics, while foundational, often struggle to capture complex geological patterns. Multi-point statistics offers more flexibility, but comes with its own challenges. With the rise of Generative Adversarial Networks (GANs) and their success in various fields, there has been a shift towards using them for facies generation. However, recent advances in the computer vision domain have shown the superiority of diffusion models over GANs. Motivated by this, a novel Latent Diffusion Model is proposed, which is specifically designed for conditional generation of reservoir facies. The proposed model produces high-fidelity facies realizations that rigorously preserve conditioning data. It significantly outperforms a GAN-based alternative.
    Convex and Non-convex Optimization Under Generalized Smoothness. (arXiv:2306.01264v2 [math.OC] UPDATED)
    Classical analysis of convex and non-convex optimization methods often requires the Lipshitzness of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting.
    Learning nonparametric latent causal graphs with unknown interventions. (arXiv:2306.02899v2 [stat.ML] UPDATED)
    We establish conditions under which latent causal graphs are nonparametrically identifiable and can be reconstructed from unknown interventions in the latent space. Our primary focus is the identification of the latent structure in measurement models without parametric assumptions such as linearity or Gaussianity. Moreover, we do not assume the number of hidden variables is known, and we show that at most one unknown intervention per hidden variable is needed. This extends a recent line of work on learning causal representations from observations and interventions. The proofs are constructive and introduce two new graphical concepts -- imaginary subsets and isolated edges -- that may be useful in their own right. As a matter of independent interest, the proofs also involve a novel characterization of the limits of edge orientations within the equivalence class of DAGs induced by unknown interventions. These are the first results to characterize the conditions under which causal representations are identifiable without making any parametric assumptions in a general setting with unknown interventions and without faithfulness.
    Faithful and Robust Local Interpretability for Textual Predictions. (arXiv:2311.01605v1 [cs.CL])
    Interpretability is essential for machine learning models to be trusted and deployed in critical domains. However, existing methods for interpreting text models are often complex, lack solid mathematical foundations, and their performance is not guaranteed. In this paper, we propose FRED (Faithful and Robust Explainer for textual Documents), a novel method for interpreting predictions over text. FRED identifies key words in a document that significantly impact the prediction when removed. We establish the reliability of FRED through formal definitions and theoretical analyses on interpretable classifiers. Additionally, our empirical evaluation against state-of-the-art methods demonstrates the effectiveness of FRED in providing insights into text models.
    Proportional Response: Contextual Bandits for Simple and Cumulative Regret Minimization. (arXiv:2307.02108v3 [cs.LG] UPDATED)
    In many applications, e.g. in healthcare and e-commerce, the goal of a contextual bandit may be to learn an optimal treatment assignment policy at the end of the experiment. That is, to minimize simple regret. However, this objective remains understudied. We propose a new family of computationally efficient bandit algorithms for the stochastic contextual bandit setting, where a tuning parameter determines the weight placed on cumulative regret minimization (where we establish near-optimal minimax guarantees) versus simple regret minimization (where we establish state-of-the-art guarantees). Our algorithms work with any function class, are robust to model misspecification, and can be used in continuous arm settings. This flexibility comes from constructing and relying on "conformal arm sets" (CASs). CASs provide a set of arms for every context, encompassing the context-specific optimal arm with a certain probability across the context distribution. Our positive results on simple and cumulative regret guarantees are contrasted with a negative result, which shows that no algorithm can achieve instance-dependent simple regret guarantees while simultaneously achieving minimax optimal cumulative regret guarantees.
    Variable Selection in Maximum Mean Discrepancy for Interpretable Distribution Comparison. (arXiv:2311.01537v1 [stat.ML])
    Two-sample testing decides whether two datasets are generated from the same distribution. This paper studies variable selection for two-sample testing, the task being to identify the variables (or dimensions) responsible for the discrepancies between the two distributions. This task is relevant to many problems of pattern analysis and machine learning, such as dataset shift adaptation, causal inference and model validation. Our approach is based on a two-sample test based on the Maximum Mean Discrepancy (MMD). We optimise the Automatic Relevance Detection (ARD) weights defined for individual variables to maximise the power of the MMD-based test. For this optimisation, we introduce sparse regularisation and propose two methods for dealing with the issue of selecting an appropriate regularisation parameter. One method determines the regularisation parameter in a data-driven way, and the other aggregates the results of different regularisation parameters. We confirm the validity of the proposed methods by systematic comparisons with baseline methods, and demonstrate their usefulness in exploratory analysis of high-dimensional traffic simulation data. Preliminary theoretical analyses are also provided, including a rigorous definition of variable selection for two-sample testing.
    Transport, Variational Inference and Diffusions: with Applications to Annealed Flows and Schr\"odinger Bridges. (arXiv:2307.01050v3 [stat.ML] UPDATED)
    This paper explores the connections between optimal transport and variational inference, with a focus on forward and reverse time stochastic differential equations and Girsanov transformations.We present a principled and systematic framework for sampling and generative modelling centred around divergences on path space. Our work culminates in the development of a novel score-based annealed flow technique (with connections to Jarzynski and Crooks identities from statistical physics) and a regularised iterative proportional fitting (IPF)-type objective, departing from the sequential nature of standard IPF. Through a series of generative modelling examples and a double-well-based rare event task, we showcase the potential of the proposed methods.
    Minimax Quasi-Bayesian estimation in sparse canonical correlation analysis via a Rayleigh quotient function. (arXiv:2010.08627v3 [stat.ML] UPDATED)
    Canonical correlation analysis (CCA) is a popular statistical technique for exploring relationships between datasets. In recent years, the estimation of sparse canonical vectors has emerged as an important but challenging variant of the CCA problem, with widespread applications. Unfortunately, existing rate-optimal estimators for sparse canonical vectors have high computational cost. We propose a quasi-Bayesian estimation procedure that not only achieves the minimax estimation rate, but also is easy to compute by Markov Chain Monte Carlo (MCMC). The method builds on Tan et al. (2018) and uses a re-scaled Rayleigh quotient function as the quasi-log-likelihood. However, unlike Tan et al. (2018), we adopt a Bayesian framework that combines this quasi-log-likelihood with a spike-and-slab prior to regularize the inference and promote sparsity. We investigate the empirical behavior of the proposed method on both continuous and truncated data, and we demonstrate that it outperforms several state-of-the-art methods. As an application, we use the proposed methodology to maximally correlate clinical variables and proteomic data for better understanding the Covid-19 disease.
    Allocating Divisible Resources on Arms with Unknown and Random Rewards. (arXiv:2306.16578v2 [cs.LG] UPDATED)
    We consider a decision maker allocating one unit of renewable and divisible resource in each period on a number of arms. The arms have unknown and random rewards whose means are proportional to the allocated resource and whose variances are proportional to an order $b$ of the allocated resource. In particular, if the decision maker allocates resource $A_i$ to arm $i$ in a period, then the reward $Y_i$ is$Y_i(A_i)=A_i \mu_i+A_i^b \xi_{i}$, where $\mu_i$ is the unknown mean and the noise $\xi_{i}$ is independent and sub-Gaussian. When the order $b$ ranges from 0 to 1, the framework smoothly bridges the standard stochastic multi-armed bandit and online learning with full feedback. We design two algorithms that attain the optimal gap-dependent and gap-independent regret bounds for $b\in [0,1]$, and demonstrate a phase transition at $b=1/2$. The theoretical results hinge on a novel concentration inequality we have developed that bounds a linear combination of sub-Gaussian random variables whose weights are fractional, adapted to the filtration, and monotonic.
    To Stay or Not to Stay in the Pre-train Basin: Insights on Ensembling in Transfer Learning. (arXiv:2303.03374v2 [cs.LG] UPDATED)
    Transfer learning and ensembling are two popular techniques for improving the performance and robustness of neural networks. Due to the high cost of pre-training, ensembles of models fine-tuned from a single pre-trained checkpoint are often used in practice. Such models end up in the same basin of the loss landscape, which we call the pre-train basin, and thus have limited diversity. In this work, we show that ensembles trained from a single pre-trained checkpoint may be improved by better exploring the pre-train basin, however, leaving the basin results in losing the benefits of transfer learning and in degradation of the ensemble quality. Based on the analysis of existing exploration methods, we propose a more effective modification of the Snapshot Ensembles (SSE) for transfer learning setup, StarSSE, which results in stronger ensembles and uniform model soups.
    Provably Convergent Data-Driven Convex-Nonconvex Regularization. (arXiv:2310.05812v2 [cs.LG] UPDATED)
    An emerging new paradigm for solving inverse problems is via the use of deep learning to learn a regularizer from data. This leads to high-quality results, but often at the cost of provable guarantees. In this work, we show how well-posedness and convergent regularization arises within the convex-nonconvex (CNC) framework for inverse problems. We introduce a novel input weakly convex neural network (IWCNN) construction to adapt the method of learned adversarial regularization to the CNC framework. Empirically we show that our method overcomes numerical issues of previous adversarial methods.
    On some limitations of data-driven weather forecasting models. (arXiv:2309.08473v2 [stat.ML] UPDATED)
    As in many other areas of engineering and applied science, Machine Learning (ML) is having a profound impact in the domain of Weather and Climate Prediction. A very recent development in this area has been the emergence of fully data-driven ML prediction models which routinely claim superior performance to that of traditional physics-based models. In this work, we examine some aspects of the forecasts produced by an exemplar of the current generation of ML models, Pangu-Weather, with a focus on the fidelity and physical consistency of those forecasts and how these characteristics relate to perceived forecast performance. The main conclusion is that Pangu-Weather forecasts, and possibly those of similar ML models, do not have the fidelity and physical consistency of physics-based models and their advantage in accuracy on traditional deterministic metrics of forecast skill can be at least partly attributed to these peculiarities. Balancing forecast skill and physical consistency of ML-driven predictions will be an important consideration for future ML models. However, and similarly to other modern post-processing technologies, the current ML models appear to be already able to add value to standard NWP output for specific forecast applications and combined with their extremely low computational cost during deployment, are set to provide an additional, useful source of forecast information. .
    Gradient Flows for Sampling: Mean-Field Models, Gaussian Approximations and Affine Invariance. (arXiv:2302.11024v6 [stat.ML] UPDATED)
    Sampling a probability distribution with an unknown normalization constant is a fundamental problem in computational science and engineering. This task may be cast as an optimization problem over all probability measures, and an initial distribution can be evolved to the desired minimizer dynamically via gradient flows. Mean-field models, whose law is governed by the gradient flow in the space of probability measures, may also be identified; particle approximations of these mean-field models form the basis of algorithms. The gradient flow approach is also the basis of algorithms for variational inference, in which the optimization is performed over a parameterized family of probability distributions such as Gaussians, and the underlying gradient flow is restricted to the parameterized family. By choosing different energy functionals and metrics for the gradient flow, different algorithms with different convergence properties arise. In this paper, we concentrate on the Kullback-Leibler divergence after showing that, up to scaling, it has the unique property that the gradient flows resulting from this choice of energy do not depend on the normalization constant. For the metrics, we focus on variants of the Fisher-Rao, Wasserstein, and Stein metrics; we introduce the affine invariance property for gradient flows, and their corresponding mean-field models, determine whether a given metric leads to affine invariance, and modify it to make it affine invariant if it does not. We study the resulting gradient flows in both probability density space and Gaussian space. The flow in the Gaussian space may be understood as a Gaussian approximation of the flow. We demonstrate that the Gaussian approximation based on the metric and through moment closure coincide, establish connections between them, and study their long-time convergence properties showing the advantages of affine invariance.
    Bayes beats Cross Validation: Efficient and Accurate Ridge Regression via Expectation Maximization. (arXiv:2310.18860v2 [stat.ML] UPDATED)
    We present a novel method for tuning the regularization hyper-parameter, $\lambda$, of a ridge regression that is faster to compute than leave-one-out cross-validation (LOOCV) while yielding estimates of the regression parameters of equal, or particularly in the setting of sparse covariates, superior quality to those obtained by minimising the LOOCV risk. The LOOCV risk can suffer from multiple and bad local minima for finite $n$ and thus requires the specification of a set of candidate $\lambda$, which can fail to provide good solutions. In contrast, we show that the proposed method is guaranteed to find a unique optimal solution for large enough $n$, under relatively mild conditions, without requiring the specification of any difficult to determine hyper-parameters. This is based on a Bayesian formulation of ridge regression that we prove to have a unimodal posterior for large enough $n$, allowing for both the optimal $\lambda$ and the regression coefficients to be jointly learned within an iterative expectation maximization (EM) procedure. Importantly, we show that by utilizing an appropriate preprocessing step, a single iteration of the main EM loop can be implemented in $O(\min(n, p))$ operations, for input data with $n$ rows and $p$ columns. In contrast, evaluating a single value of $\lambda$ using fast LOOCV costs $O(n \min(n, p))$ operations when using the same preprocessing. This advantage amounts to an asymptotic improvement of a factor of $l$ for $l$ candidate values for $\lambda$ (in the regime $q, p \in O(\sqrt{n})$ where $q$ is the number of regression targets).
    Bayesian learning of feature spaces for multitasks problems. (arXiv:2209.03028v2 [stat.ML] UPDATED)
    This paper introduces a novel approach for multi-task regression that connects Kernel Machines (KMs) and Extreme Learning Machines (ELMs) through the exploitation of the Random Fourier Features (RFFs) approximation of the RBF kernel. In this sense, one of the contributions of this paper shows that for the proposed models, the KM and the ELM formulations can be regarded as two sides of the same coin. These proposed models, termed RFF-BLR, stand on a Bayesian framework that simultaneously addresses two main design goals. On the one hand, it fits multitask regressors based on KMs endowed with RBF kernels. On the other hand, it enables the introduction of a common-across-tasks prior that promotes multioutput sparsity in the ELM view. This Bayesian approach facilitates the simultaneous consideration of both the KM and ELM perspectives enabling (i) the optimisation of the RBF kernel parameter $\gamma$ within a probabilistic framework, (ii) the optimisation of the model complexity, and (iii) an efficient transfer of knowledge across tasks. The experimental results show that this framework can lead to significant performance improvements compared to the state-of-the-art methods in multitask nonlinear regression.
    Finite-Time Logarithmic Bayes Regret Upper Bounds. (arXiv:2306.09136v2 [cs.LG] UPDATED)
    We derive the first finite-time logarithmic Bayes regret upper bounds for Bayesian bandits. In Gaussian bandits, we obtain $O(c_\Delta \log n)$ and $O(c_h \log^2 n)$ bounds for an upper confidence bound algorithm, where $c_h$ and $c_\Delta$ are constants depending on the prior distribution and the gaps of random bandit instances sampled from it, respectively. The latter bound asymptotically matches the lower bound of Lai (1987). Our proofs are a major technical departure from prior works, while being simple and general. To show the generality of our techniques, we apply them to linear bandits. Our results provide insights on the value of prior in the Bayesian setting, both in the objective and as a side information given to the learner. They significantly improve upon existing $\tilde{O}(\sqrt{n})$ bounds, which have become standard in the literature despite the existing lower bounds.
    On the Generalization Properties of Diffusion Models. (arXiv:2311.01797v1 [cs.LG])
    Diffusion models are a class of generative models that serve to establish a stochastic transport map between an empirically observed, yet unknown, target distribution and a known prior. Despite their remarkable success in real-world applications, a theoretical understanding of their generalization capabilities remains underdeveloped. This work embarks on a comprehensive theoretical exploration of the generalization attributes of diffusion models. We establish theoretical estimates of the generalization gap that evolves in tandem with the training dynamics of score-based diffusion models, suggesting a polynomially small generalization error ($O(n^{-2/5}+m^{-4/5})$) on both the sample size $n$ and the model capacity $m$, evading the curse of dimensionality (i.e., not exponentially large in the data dimension) when early-stopped. Furthermore, we extend our quantitative analysis to a data-dependent scenario, wherein target distributions are portrayed as a succession of densities with progressively increasing distances between modes. This precisely elucidates the adverse effect of "modes shift" in ground truths on the model generalization. Moreover, these estimates are not solely theoretical constructs but have also been confirmed through numerical simulations. Our findings contribute to the rigorous understanding of diffusion models' generalization properties and provide insights that may guide practical applications.  ( 2 min )
    Solving Kernel Ridge Regression with Gradient Descent for a Non-Constant Kernel. (arXiv:2311.01762v1 [stat.ML])
    Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. The solution can be obtained either as a closed-form solution, which includes a matrix inversion, or iteratively through gradient descent. Using the iterative approach opens up for changing the kernel during training, something that is investigated in this paper. We theoretically address the effects this has on model complexity and generalization. Based on our findings, we propose an update scheme for the bandwidth of translational-invariant kernels, where we let the bandwidth decrease to zero during training, thus circumventing the need for hyper-parameter selection. We demonstrate on real and synthetic data how decreasing the bandwidth during training outperforms using a constant bandwidth, selected by cross-validation and marginal likelihood maximization. We also show theoretically and empirically that using a decreasing bandwidth, we are able to achieve both zero training error in combination with good generalization, and a double descent behavior, phenomena that do not occur for KRR with constant bandwidth but are known to appear for neural networks.  ( 2 min )
    Reproducible Parameter Inference Using Bagged Posteriors. (arXiv:2311.02019v1 [stat.ME])
    Under model misspecification, it is known that Bayesian posteriors often do not properly quantify uncertainty about true or pseudo-true parameters. Even more fundamentally, misspecification leads to a lack of reproducibility in the sense that the same model will yield contradictory posteriors on independent data sets from the true distribution. To define a criterion for reproducible uncertainty quantification under misspecification, we consider the probability that two confidence sets constructed from independent data sets have nonempty overlap, and we establish a lower bound on this overlap probability that holds for any valid confidence sets. We prove that credible sets from the standard posterior can strongly violate this bound, particularly in high-dimensional settings (i.e., with dimension increasing with sample size), indicating that it is not internally coherent under misspecification. To improve reproducibility in an easy-to-use and widely applicable way, we propose to apply bagging to the Bayesian posterior ("BayesBag"'); that is, to use the average of posterior distributions conditioned on bootstrapped datasets. We motivate BayesBag from first principles based on Jeffrey conditionalization and show that the bagged posterior typically satisfies the overlap lower bound. Further, we prove a Bernstein--Von Mises theorem for the bagged posterior, establishing its asymptotic normal distribution. We demonstrate the benefits of BayesBag via simulation experiments and an application to crime rate prediction.  ( 2 min )
    Causal inference with Machine Learning-Based Covariate Representation. (arXiv:2311.01709v1 [stat.ME])
    Utilizing covariate information has been a powerful approach to improve the efficiency and accuracy for causal inference, which support massive amount of randomized experiments run on data-driven enterprises. However, state-of-art approaches can become practically unreliable when the dimension of covariate increases to just 50, whereas experiments on large platforms can observe even higher dimension of covariate. We propose a machine-learning-assisted covariate representation approach that can effectively make use of historical experiment or observational data that are run on the same platform to understand which lower dimensions can effectively represent the higher-dimensional covariate. We then propose design and estimation methods with the covariate representation. We prove statistically reliability and performance guarantees for the proposed methods. The empirical performance is demonstrated using numerical experiments.  ( 2 min )
    Learning Sparse Codes with Entropy-Based ELBOs. (arXiv:2311.01888v1 [stat.ML])
    Standard probabilistic sparse coding assumes a Laplace prior, a linear mapping from latents to observables, and Gaussian observable distributions. We here derive a solely entropy-based learning objective for the parameters of standard sparse coding. The novel variational objective has the following features: (A) unlike MAP approximations, it uses non-trivial posterior approximations for probabilistic inference; (B) unlike for previous non-trivial approximations, the novel objective is fully analytical; and (C) the objective allows for a novel principled form of annealing. The objective is derived by first showing that the standard ELBO objective converges to a sum of entropies, which matches similar recent results for generative models with Gaussian priors. The conditions under which the ELBO becomes equal to entropies are then shown to have analytical solutions, which leads to the fully analytical objective. Numerical experiments are used to demonstrate the feasibility of learning with such entropy-based ELBOs. We investigate different posterior approximations including Gaussians with correlated latents and deep amortized approximations. Furthermore, we numerically investigate entropy-based annealing which results in improved learning. Our main contributions are theoretical, however, and they are twofold: (1) for non-trivial posterior approximations, we provide the (to the knowledge of the authors) first analytical ELBO objective for standard probabilistic sparse coding; and (2) we provide the first demonstration on how a recently shown convergence of the ELBO to entropy sums can be used for learning.  ( 2 min )
    Online non-parametric likelihood-ratio estimation by Pearson-divergence functional minimization. (arXiv:2311.01900v1 [stat.ML])
    Quantifying the difference between two probability density functions, $p$ and $q$, using available data, is a fundamental problem in Statistics and Machine Learning. A usual approach for addressing this problem is the likelihood-ratio estimation (LRE) between $p$ and $q$, which -- to our best knowledge -- has been investigated mainly for the offline case. This paper contributes by introducing a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t \sim p, x'_t \sim q)$ are observed over time. The non-parametric nature of our approach has the advantage of being agnostic to the forms of $p$ and $q$. Moreover, we capitalize on the recent advances in Kernel Methods and functional minimization to develop an estimator that can be efficiently updated online. We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.  ( 2 min )
    High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise. (arXiv:2311.02000v1 [math.OC])
    In this paper, we study the convergence of the Adaptive Moment Estimation (Adam) algorithm under unconstrained non-convex smooth stochastic optimizations. Despite the widespread usage in machine learning areas, its theoretical properties remain limited. Prior researches primarily investigated Adam's convergence from an expectation view, often necessitating strong assumptions like uniformly stochastic bounded gradients or problem-dependent knowledge in prior. As a result, the applicability of these findings in practical real-world scenarios has been constrained. To overcome these limitations, we provide a deep analysis and show that Adam could converge to the stationary point in high probability with a rate of $\mathcal{O}\left({\rm poly}(\log T)/\sqrt{T}\right)$ under coordinate-wise "affine" variance noise, not requiring any bounded gradient assumption and any problem-dependent knowledge in prior to tune hyper-parameters. Additionally, it is revealed that Adam confines its gradients' magnitudes within an order of $\mathcal{O}\left({\rm poly}(\log T)\right)$. Finally, we also investigate a simplified version of Adam without one of the corrective terms and obtain a convergence rate that is adaptive to the noise level.  ( 2 min )
    Local Bayesian Dirichlet mixing of imperfect models. (arXiv:2311.01596v1 [stat.ME])
    To improve the predictability of complex computational models in the experimentally-unknown domains, we propose a Bayesian statistical machine learning framework utilizing the Dirichlet distribution that combines results of several imperfect models. This framework can be viewed as an extension of Bayesian stacking. To illustrate the method, we study the ability of Bayesian model averaging and mixing techniques to mine nuclear masses. We show that the global and local mixtures of models reach excellent performance on both prediction accuracy and uncertainty quantification and are preferable to classical Bayesian model averaging. Additionally, our statistical analysis indicates that improving model predictions through mixing rather than mixing of corrected models leads to more robust extrapolations.  ( 2 min )
    Efficient Generalized Low-Rank Tensor Contextual Bandits. (arXiv:2311.01771v1 [cs.LG])
    In this paper, we aim to build a novel bandits algorithm that is capable of fully harnessing the power of multi-dimensional data and the inherent non-linearity of reward functions to provide high-usable and accountable decision-making services. To this end, we introduce a generalized low-rank tensor contextual bandits model in which an action is formed from three feature vectors, and thus can be represented by a tensor. In this formulation, the reward is determined through a generalized linear function applied to the inner product of the action's feature tensor and a fixed but unknown parameter tensor with a low tubal rank. To effectively achieve the trade-off between exploration and exploitation, we introduce a novel algorithm called "Generalized Low-Rank Tensor Exploration Subspace then Refine" (G-LowTESTR). This algorithm first collects raw data to explore the intrinsic low-rank tensor subspace information embedded in the decision-making scenario, and then converts the original problem into an almost lower-dimensional generalized linear contextual bandits problem. Rigorous theoretical analysis shows that the regret bound of G-LowTESTR is superior to those in vectorization and matricization cases. We conduct a series of simulations and real data experiments to further highlight the effectiveness of G-LowTESTR, leveraging its ability to capitalize on the low-rank tensor structure for enhanced learning.  ( 2 min )
    Invariant Causal Imitation Learning for Generalizable Policies. (arXiv:2311.01489v1 [stat.ML])
    Consider learning an imitation policy on the basis of demonstrated behavior from multiple environments, with an eye towards deployment in an unseen environment. Since the observable features from each setting may be different, directly learning individual policies as mappings from features to actions is prone to spurious correlations -- and may not generalize well. However, the expert's policy is often a function of a shared latent structure underlying those observable features that is invariant across settings. By leveraging data from multiple environments, we propose Invariant Causal Imitation Learning (ICIL), a novel technique in which we learn a feature representation that is invariant across domains, on the basis of which we learn an imitation policy that matches expert behavior. To cope with transition dynamics mismatch, ICIL learns a shared representation of causal features (for all training environments), that is disentangled from the specific representations of noise variables (for each of those environments). Moreover, to ensure that the learned policy matches the observation distribution of the expert's policy, ICIL estimates the energy of the expert's observations and uses a regularization term that minimizes the imitator policy's next state energy. Experimentally, we compare our methods against several benchmarks in control and healthcare tasks and show its effectiveness in learning imitation policies capable of generalizing to unseen environments.  ( 2 min )
    Obtaining Explainable Classification Models using Distributionally Robust Optimization. (arXiv:2311.01994v1 [stat.ML])
    Model explainability is crucial for human users to be able to interpret how a proposed classifier assigns labels to data based on its feature values. We study generalized linear models constructed using sets of feature value rules, which can capture nonlinear dependencies and interactions. An inherent trade-off exists between rule set sparsity and its prediction accuracy. It is computationally expensive to find the right choice of sparsity -- e.g., via cross-validation -- with existing methods. We propose a new formulation to learn an ensemble of rule sets that simultaneously addresses these competing factors. Good generalization is ensured while keeping computational costs low by utilizing distributionally robust optimization. The formulation utilizes column generation to efficiently search the space of rule sets and constructs a sparse ensemble of rule sets, in contrast with techniques like random forests or boosting and their variants. We present theoretical results that motivate and justify the use of our distributionally robust formulation. Extensive numerical experiments establish that our method improves over competing methods -- on a large set of publicly available binary classification problem instances -- with respect to one or more of the following metrics: generalization quality, computational cost, and explainability.  ( 2 min )
    Sketching for Convex and Nonconvex Regularized Least Squares with Sharp Guarantees. (arXiv:2311.01806v1 [math.OC])
    Randomized algorithms are important for solving large-scale optimization problems. In this paper, we propose a fast sketching algorithm for least square problems regularized by convex or nonconvex regularization functions, Sketching for Regularized Optimization (SRO). Our SRO algorithm first generates a sketch of the original data matrix, then solves the sketched problem. Different from existing randomized algorithms, our algorithm handles general Frechet subdifferentiable regularization functions in an unified framework. We present general theoretical result for the approximation error between the optimization results of the original problem and the sketched problem for regularized least square problems which can be convex or nonconvex. For arbitrary convex regularizer, relative-error bound is proved for the approximation error. Importantly, minimax rates for sparse signal estimation by solving the sketched sparse convex or nonconvex learning problems are also obtained using our general theoretical result under mild conditions. To the best of our knowledge, our results are among the first to demonstrate minimax rates for convex or nonconvex sparse learning problem by sketching under a unified theoretical framework. We further propose an iterative sketching algorithm which reduces the approximation error exponentially by iteratively invoking the sketching algorithm. Experimental results demonstrate the effectiveness of the proposed SRO and Iterative SRO algorithms.  ( 2 min )
    Should Under-parameterized Student Networks Copy or Average Teacher Weights?. (arXiv:2311.01644v1 [cs.LG])
    Any continuous function $f^*$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$. We consider the case when $f^*$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f^*$ with a neural network with $n< k$ neurons can thus be seen as fitting an under-parameterized "student" network with $n$ neurons to a "teacher" network with $k$ neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when $n-1$ student neurons each copy one teacher neuron and the $n$-th student neuron averages the remaining $k-n+1$ teacher neurons. For the student network with $n=1$ neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.  ( 3 min )
    Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos. (arXiv:2311.02076v1 [cs.LG])
    In gradient descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training. This includes early time regimes where the sharpness may decrease during early periods of training (sharpness reduction), and later time behavior such as progressive sharpening and edge of stability. We demonstrate that a simple $2$-layer linear network (UV model) trained on a single training example exhibits all of the essential sharpness phenomenology observed in real-world scenarios. By analyzing the structure of dynamical fixed points in function space and the vector field of function updates, we uncover the underlying mechanisms behind these sharpness trends. Our analysis reveals (i) the mechanism behind early sharpness reduction and progressive sharpening, (ii) the required conditions for edge of stability, and (iii) a period-doubling route to chaos on the edge of stability manifold as learning rate is increased. Finally, we demonstrate that various predictions from this simplified model generalize to real-world scenarios and discuss its limitations.  ( 2 min )
    Maximum Likelihood Estimation of Flexible Survival Densities with Importance Sampling. (arXiv:2311.01660v1 [cs.LG])
    Survival analysis is a widely-used technique for analyzing time-to-event data in the presence of censoring. In recent years, numerous survival analysis methods have emerged which scale to large datasets and relax traditional assumptions such as proportional hazards. These models, while being performant, are very sensitive to model hyperparameters including: (1) number of bins and bin size for discrete models and (2) number of cluster assignments for mixture-based models. Each of these choices requires extensive tuning by practitioners to achieve optimal performance. In addition, we demonstrate in empirical studies that: (1) optimal bin size may drastically differ based on the metric of interest (e.g., concordance vs brier score), and (2) mixture models may suffer from mode collapse and numerical instability. We propose a survival analysis approach which eliminates the need to tune hyperparameters such as mixture assignments and bin sizes, reducing the burden on practitioners. We show that the proposed approach matches or outperforms baselines on several real-world datasets.  ( 2 min )
    Applications of the Theory of Aggregated Markov Processes in Stochastic Learning Theory. (arXiv:2311.01476v1 [stat.ML])
    A stochastic process that arises by composing a function with a Markov process is called an aggregated Markov process (AMP). The purpose of composing a Markov process with a function can be a reduction of dimensions, e.g., a projection onto certain coordinates. The theory around AMP has been extensively studied e.g. by Dynkin, Cameron, Rogers and Pitman, and Kelly, all of whom provided sufficient conditions for an AMP to remain Markov. In another direction, Larget provided a canonical representation for AMP, which can be used to verify the equivalence of two AMPs. The purpose of this paper is to describe how the theory of AMP can be applied to stochastic learning theory as they learn a particular task.  ( 2 min )

  • Open

    [D] if your company is ingesting work emails and chats for AI/ML pipelines, is there concern around sensitive business info getting out?
    Hi folks Firstly full disclosure I’m the CEO of DataFog (www.datafog.ai). This is NOT a sales pitch but rather an interest in hearing what the community thinks about the overall issue which I believe will ultimately be solved via an ML-based implementation. My contention is: - Generative AI has catalyzed widespread practice of ingesting email and work chat content to power AI training and inference - this introduces a risk of content concerning confidential corporate affairs* that can pass most privacy filters This results in Raw data alluding to sensitive business events flowing in freely for easy accidental unauthorized access by an internal - like MLOps - user My second contention is that the current security tools may not offer adequate coverage for what will be an evolving ongoin…
    [R] AI Alignment: A Comprehensive Survey
    https://arxiv.org/abs/2310.19852 ​ AI alignment aims to make AI systems behave in line with human intentions and values. As AI systems grow more capable, the potential large-scale risks associated with misaligned AI systems become salient. Hundreds of AI experts and public figures have expressed concerns about AI risks, arguing that "mitigating the risk of extinction from AI should be a global priority, alongside other societal-scale risks such as pandemics and nuclear war". To provide a comprehensive and up-to-date overview of the alignment field, in this survey paper, we delve into the core concepts, methodology, and practice of alignment. We identify the RICE principles as the key objectives of AI alignment: Robustness, Interpretability, Controllability, and Ethicality. Guided by these four principles, we outline the landscape of current alignment research and decompose them into two key components: forward alignment and backward alignment. The former aims to make AI systems aligned via alignment training, while the latter aims to gain evidence about the systems' alignment and govern them appropriately to avoid exacerbating misalignment risks. Forward alignment and backward alignment form a recurrent process where the alignment of AI systems from the forward process is verified in the backward process, meanwhile providing updated objectives for forward alignment in the next round. On forward alignment, we discuss learning from feedback and learning under distribution shift. On backward alignment, we discuss assurance techniques and governance practices that apply to every stage of AI systems' lifecycle. submitted by /u/mcaleste [link] [comments]
    [D] What are the top 3 best application types/implementation types/system types of each of Java, Python, and C# and/or .NET with respect to current and likely future AI/ML developments?
    I’d like to know for all three languages. Thank you. submitted by /u/hdtv2001 [link] [comments]
    [R] Interpreting CLIP's Image Representation via Text-Based Decomposition
    Blog: https://yossigandelsman.github.io/clip_decomposition/index.html Paper: https://arxiv.org/abs/2310.05916 This paper investigates the CLIP image encoder by analyzing how individual model components affect the final representation. The authors show that the CLIP image representation can be decomposed as a sum across individual image patches, model layers, and attention heads. They use CLIP's text representation to interpret these summands. Interpreting the attention heads, they characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location, counting, texture, shape, and OCR). Interpreting the image patches, they uncover an emergent spatial localization within CLIP. Finally, they use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. submitted by /u/Ordinary_Pollution70 [link] [comments]
    [D] How can a self-learning system, autonomously and without any user feedback, discover positive net value prompts that when incorporated could improve its overall performance?
    Take something as chain-of-thought or "let's think step by step". There's a lot of anecdotal and research evidence to support that with this suffix added, GPT4 provides better reasoning (that's one example out of many). I'd like to believe that companies like OpenAI, Google, Anthropic, MSFT, etal. are making good use of the troves of user-data they collect. I feel like it's fairly easier to find "negative" prompts because the desired outcome is known -- for example, i can find prompts that are jailbreaking the LLM just by feeding every pair of input/output to another dedicated agent (that could run at a significantly much lower parameters count, and for a fraction of the cost of the main model) that evaluates whether there has been a "mold breakout". The same can also be said for finding…
    [Project] Which kind of machine learning should be used for parallel pumping energy efficiency optimization?
    Hello everyone I'm about to start writing my bachelor in Mechatronics and for this project I want to work with some kind of machine learning. The system I work with is a parallel pumping system for a cooling system. I have 2 years of data sampled every 10 minutes available of the power, pressure, flow and other necessary variables. I've seen some other papers implementing neural networks. Is that the only solution to this kind of problem? I'm not that knowledgeable about machine learning, so I was hoping some of you bright minds might help enlighten me in which direction I should go :) Thank you for taking the time to read! submitted by /u/vision_dev [link] [comments]
    [Project] Looking for DI-D alternatives for my specific use case
    What is my best bet for creating an ai avatar assistant of lets say myself where I am already using gpt-4 and elevelabs for tts. Considered using DI-D to generate videos for each response to the user or using the streaming method. But DI-D's costs are out of this world. I'd prefer going the open source route and using cloud tech to generate these ai generated clips/videos. In this case I don't need to use stable diffusion for creating avatars as my avatars will simple be a model of myself. What is the best open-source AI animation generator for that specific purpose of using images and can generate videos/clips fast with the right equipment? submitted by /u/Izzy-gang- [link] [comments]
    [D] Best Domain Specific Embeddings?
    Been building a bunch of RAG apps recently and was wondering what the best domain specific models were. OpenAI’s Ada embedding model is not great for field specific texts and encoders like bioBERT or sentence transformers on hugging face don’t quite achieve the level of performance I’m looking for. Was wondering if there were any better options people have found. submitted by /u/Primary-Track8298 [link] [comments]
    [D] Understanding the diffusion process in denoising diffusion models
    I'm reading the DDPM paper and I don't understand the diffusion process definition; question is in the caption to the below image: ​ Why is the highlighted part there? Shouldn't the mean of the distribution literally be x_{t-1}? Why would the center of the distribution be something other than the \"starting point\" of this step? submitted by /u/OneQuadrillionOwls [link] [comments]
    [D] Real-world AI/ML/CV for social good projects, companies, startup ideas
    Background: working at the forefront of Computer Vision and ML (PhD level) but I feel like all the academic research is anyway gonna be used by some companies for profit or buried under heaps of new papers coming out everyday. Fed up of working to make someone rich or on something many people are working anyway and has a very small probability to have any impact. Looking for ideas for such problems. I’d be interested to work on real-world impactful problems even if I can make a tiny dent. I’ll start: https://ai4good.org/ submitted by /u/4_love_of_Sophia [link] [comments]
    [N] OpenAI Whisper new model Large V3 just released and amazing
    Whisper made huge impact on the open source AI world I am using everyday to transcribe my videos with that I was waiting new Large model Whisper is much better than paid alternatives and it is 100% free Here my full tutorial about it How to do Free Speech-to-Text Transcription Better Than Google Premium API with OpenAI Whisper Model Repo link : https://github.com/openai/whisper ​ https://preview.redd.it/k9kssc6csryb1.png?width=1920&format=png&auto=webp&s=caeaf4921c8b4f9337c4842c5ef897cf456adc20 submitted by /u/CeFurkan [link] [comments]
    [R] Analogues of Azure Automated ML
    Hello, I am looking for analogues of Azure Automated Machine Learning. Main features I want are GUI, in which I can built wgole pipeline and the platform should be able to process non-table data (e.g. images, etc). Any help is appreciate, thanks in advance. submitted by /u/gubby235 [link] [comments]
    [R] GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling
    ​ https://preview.redd.it/3qgz8mnnvryb1.png?width=8522&format=png&auto=webp&s=943198b1b40c366596abba514ffbe134aa74ee8b Paper: https://arxiv.org/abs/2311.01927 The authors introduce GateLoop, a fully data-controlled linear recurrent model which generalizes the recently proposed RetNet. GateLoop crushes the state-of-the-art models (Transformer, SSM and Hyena) on natural language modeling. Abstract Linear Recurrence has proven to be a powerful tool for modeling long sequences efficiently. In this work, we show that existing models fail to take full advantage of its potential. Motivated by this finding, we develop GateLoop, a foundational sequence model that generalizes linear recurrent models such as S4, S5, LRU and RetNet, by employing data-controlled state transitions. Utilizing this theoretical advance, GateLoop empirically outperforms existing models for auto-regressive language modeling. Our method comes with a low-cost O(l) recurrent mode and an efficient O(l log l) parallel mode making use of highly optimized associative scan implementations. Furthermore, we derive an O(l^2) surrogate attention mode, revealing remarkable implications for Transformer and recently proposed architectures. Specifically, we prove that our approach can be interpreted as providing data-controlled relative-positional information to Attention. While many existing models solely rely on data-controlled cumulative sums for context aggregation, our findings suggest that incorporating data-controlled complex cumulative products may be a crucial step towards more powerful sequence models. submitted by /u/Gorgoroth117 [link] [comments]  ( 9 min )
    [D] AI survey paper for an economist
    Hi! My wife is writing her bachelors thesis in business economics/accountancy on the impact of AI in accountancy, especially from an ethical/good practices standpoint. She asked me for some papers that discuss a definition of AI and possibly a good (very) high level survey paper of the field. I have a sort of working definition in my head, as I suppose most do, but no good pappers to recommend. I could rattle of 50 of the most influential papers but that's not really useful as a survey for someone not in the field. I figured Artificial Intelligence: A modern approach would have a definition but it is a bit unwieldy and far to long as a survey. Any suggestions? submitted by /u/-Melchizedek- [link] [comments]
    [D] Google's Vertex AI Review?
    Our startup is looking into Google's Vertex AI for semantic search/embedding capabilities as opposed to what we built. Anyone here have experience using this? What was your overall impression and what was your final GCP bill lol. Any info you can provide helps! submitted by /u/SiftreeHQ [link] [comments]
    [D] How are the popular LLM API servings optimized?
    Currently there are a ton of offerings of various large langauge models hosted by companies like Together AI, Perplexity, Replit and many others. They seem pretty fast especially for the 30B+ model sizes. Anyone know how these are optimized? Apart from the horizontal scaling across GPUs and probably dynamic batching (assuming the requests are large in number), what else are these companies doing? Some of these companies also released the APIs the very next day the models come out - which also means that libraries which do low level CUDA/system-level optimizations (vLLM, Fastertransformer) also don't support these models. Hence couldn't be used in those APIs probably. Looking to learn how these are served. TIA! submitted by /u/shreyansh26 [link] [comments]
    [R] (Very detailed) Mathematical Introduction to Deep Learning: Methods, Implementations, and Theory
    Arxiv: https://arxiv.org/abs/2310.20360 601 pages, 36 figures, 45 source codes This book aims to provide an introduction to the topic of deep learning algorithms. We review essential components of deep learning algorithms in full mathematical detail including different artificial neural network (ANN) architectures (such as fully-connected feedforward ANNs, convolutional ANNs, recurrent ANNs, residual ANNs, and ANNs with batch normalization) and different optimization algorithms (such as the basic stochastic gradient descent (SGD) method, accelerated methods, and adaptive methods). We also cover several theoretical aspects of deep learning algorithms such as approximation capacities of ANNs (including a calculus for ANNs), optimization theory (including Kurdyka-Łojasiewicz inequalities), and generalization errors. In the last part of the book some deep learning approximation methods for PDEs are reviewed including physics-informed neural networks (PINNs) and deep Galerkin methods. We hope that this book will be useful for students and scientists who do not yet have any background in deep learning at all and would like to gain a solid foundation as well as for practitioners who would like to obtain a firmer mathematical understanding of the objects and methods considered in deep learning. submitted by /u/ghosthamlet [link] [comments]  ( 9 min )
    [D] what are the limits of input and output tokens for different LLMs over time?
    Mostly looking for evolution of LLM token limits. Any blog or reading resource would be helpful. Thanks. submitted by /u/Current_Dark6603 [link] [comments]
    [D] Is there any work regarding the effects of text quality when pre-training CLIP-like models?
    Most of the papers I read don't seem to address the quality of the data used when pre-training a CLIP-like model. What I'm trying to do is use longer and more descriptive text as well as their shorter caption-like counterparts. I was curious if there has been any work done in that direction but am not having any luck finding it. A technical report titled Scaling Language-Image Pre-training via Masking (Li et al., 2022) claims to have used a maximum sequence length of 32 tokens, whereas the original CLIP uses 77. They claim that the difference was marginal. However, the difference that I'm looking for is something like 77 vs. 512. If anyone has any idea on what kind of papers there may be, I'd appreciate the tip. submitted by /u/Seankala [link] [comments]
    [R] Idempotent Generative Network
    Paper: https://arxiv.org/abs/2311.01462 Blog: https://assafshocher.github.io/IGN/ Abstract: We propose a new approach for generative modeling based on training a neural network to be idempotent. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, namely f(f(z))=f(z). The proposed model f is trained to map a source distribution (e.g, Gaussian noise) to a target distribution (e.g. realistic images) using the following objectives: (1) Instances from the target distribution should map to themselves, namely f(x)=x. We define the target manifold as the set of all instances that f maps to themselves. (2) Instances that form the source distribution should map onto the defined target manifold. This is achieved by optimizing the idempotence term, f(f(z))=f(z) which encourages the range of f(z) to be on the target manifold. Under ideal assumptions such a process provably converges to the target distribution. This strategy results in a model capable of generating an output in one step, maintaining a consistent latent space, while also allowing sequential applications for refinement. Additionally, we find that by processing inputs from both target and source distributions, the model adeptly projects corrupted or modified data back to the target manifold. This work is a first step towards a ``global projector'' that enables projecting any input into a target data distribution. ​ submitted by /u/APaperADay [link] [comments]
    [D] Is there an AI tool/service that converts UI images to HTML CSS code?
    Hi, I am looking for an AI tool/service that will convert UI screens (in image format) into HTML and CSS code. It should preferably also have a python API, or just a python library would be even better. Please help. submitted by /u/master-killerrr [link] [comments]  ( 9 min )
    [R] Popular Attention mechanism?
    what are some popular attention mechanism? I know sparse attention, ghost attention, and flash attention, what else? submitted by /u/No_Oilve_6577 [link] [comments]  ( 8 min )
  • Open

    Latent Space: Visualizing the complex representations of neural nets
    submitted by /u/AvvYaa [link] [comments]
    Need help with feeding forward data in neural network, using MNIST dataset
    So I'm really struggling with creating my network. The data used does not specifically have to be MNIST, but I tried using that for training and testing in this case. I find the concept of neural networks somewhat easy to understand. Some math parts however, are hard to understand. For my activation function (for all layers) I use the Sigmoid function. The MNIST dataset provides values between 0-255 with 784 neurons for input layer (28*28 pixels). Even though many values are 0, there are so many values that my sigmoid function always returns 1 since I get a large total value. I've tried normalizing the data so it ranges from 0-1 instead of 0-255, but the total sum is still too big. I don't have any negative weights, but I feel like I still land on either too large or too small sum for the sigmoid function. Becuase of this, every hidden and output layer gets 1 as activation value. Am I doing this completely wrong, or are the weights suppose to fix my issue? submitted by /u/Neat-Molasses-731 [link] [comments]
    (Pt. 4) Inductive Logic Programming with LNN's
    submitted by /u/Neurosymbolic [link] [comments]
  • Open

    Latent Space: Visualizing, interpreting, and manipulating neural networks
    Sharing a video from my channel about manipulating generative models (like VAE) in the latent space… the model was trained to generate celebrity faces, and exploring the latent space allows us to do all sorts of crazy stuff - like finding similar faces, interpolating between two faces, adding facial features (like sunglasses), and more… submitted by /u/AvvYaa [link] [comments]
    The Dark Ritual
    Enjoy this spooky short I made using Midjourney and RunwayML. CapCut for the edits. submitted by /u/Exitium_Maximus [link] [comments]
    A debate about mint leads to a philosophical duel:
    submitted by /u/GreenFlame361 [link] [comments]
    Voice translation and cloning
    Why aren’t more creators using cloning and translation technology like mrBeast? Honestly it’s a pretty sick tech and just reduces the processing/editing time. submitted by /u/exp_max8ion [link] [comments]
    AI and the Art of Cyber Intrigue: The Biggest Hacks in History
    submitted by /u/Einsof__ [link] [comments]
    China-U.S. AI Arms Race Heats Up as Chinese Startup Unveils Powerful New AI
    Chinese AI startup 01.AI, led by CEO Kai-Fu Lee, has released an open-source AI model called Yi-34B that outperforms Meta's own model, marking an early win for China in the AI arms race with the US. Lee believes that China has the potential to overtake the US as the world leader in AI technology. The model, available in English and Chinese, has gained attention for ranking first among pre-trained base LLMs on the open-source community Hugging Face's rankings. Lee aims to make better AI accessible to more people and expects the model to be useful for multinational banks and insurers. 01.AI has stockpiled chips in anticipation of further US restrictions on Chinese access to chips necessary for building AI models. Lee sees 01.AI as a necessary response to US restrictions that have limited China's ability to advance in the AI field. Lee has written extensively about the coming battle between China and the US for AI supremacy and has warned of the potential economic and social upheaval that AI will bring about. He believes that AI technology will lead to wealth concentration, rising profits for corporations, and mass unemployment. Lee envisions a world in which the US and China become the dominant players in AI, with other countries becoming economic dependents. He believes that more regulation is needed to prepare for the changes that AI will bring. Source : https://www.vice.com/en/article/pkax5n/china-us-ai-arms-race-heats-up-as-chinese-startup-unveils-powerful-new-ai submitted by /u/NuseAI [link] [comments]
    Watch the Open AI Dev Day keynote!
    It's happening right now, and of course the recording will be available later. https://www.youtube.com/watch?v=U9mJuUkhUzk Mind blown. I'd add more details, but it'll take me some time to unpack and understand the potential and the transformative nature of everything Sam Altman (and guests) are announcing. Go see for yourself. Trust me, it'll be time well spent. submitted by /u/JOWWLLL [link] [comments]
    Will Artificial Intelligence Replace Radiologists?
    Thoughts? submitted by /u/derpgod123 [link] [comments]
    What are the best text to video for ai sites
    please help submitted by /u/the_insideredge [link] [comments]
    'The risks of AI are real but manageable' -- GatesNotes
    submitted by /u/AriadneSkovgaarde [link] [comments]
    Do you trust AI to write the news? It already is – and not without issues
    submitted by /u/Jariiari7 [link] [comments]
    Siemens and Microsoft to work together on AI project
    submitted by /u/donutloop [link] [comments]
    Britain to invest 300 million pounds in AI supercomputing
    submitted by /u/donutloop [link] [comments]
    One-Minute Daily AI News 11/5/2023
    Elon Musk unveils Grok, an AI chatbot with a ‘rebellious streak’.[1] Ant Group has received Chinese government approval to release products powered by its “Bailing” artificial intelligence (AI) large language model to the public, a spokesperson for the Chinese firm said on Monday.[2] A Chinese startup founded by computer scientist Kai-Fu Lee has become a unicorn in less than eight months on the strength of a new open-source artificial-intelligence model that outstrips Silicon Valley’s best, on at least certain metrics.[3] AI game coding tools instantly result in Angry Birds clone, and opens some potentially dangerous floodgates for mobile storefronts.[4] Sources: [1] https://www.theguardian.com/technology/2023/nov/05/elon-musk-unveils-grok-an-ai-chatbot-with-a-rebellious-streak [2] https://www.reuters.com/technology/ant-group-wins-approval-release-ai-products-chinese-public-2023-11-06/ [3] https://www.bloomberg.com/news/articles/2023-11-05/kai-fu-lee-s-open-source-01-ai-bests-llama-2-according-to-hugging-face#xj4y7vzkg [4] https://www.gamesradar.com/ai-game-coding-tools-instantly-result-in-angry-birds-clone-and-opens-some-potentially-dangerous-floodgates-for-mobile-storefronts/ submitted by /u/Excellent-Target-847 [link] [comments]
    The Ultimate Self-Attention Guide: The reason it is a Game-Changer for AI
    submitted by /u/AvvYaa [link] [comments]
    Dear old world', she murmured, 'you are very lovely, and I am glad to be alive in you.
    submitted by /u/Oh_my_Winnie [link] [comments]
    Autonomous Reasoning Agents: A Beginner's Guide
    submitted by /u/BenjaminSkyy [link] [comments]
    Contrary to Common Belief, Artificial Intelligence Will Not Put You Out of Work
    New research shows that AI benefits workers with greater task-based experience, while senior workers gain less from AI due to lower trust in AI Lower trust in AI among senior workers is likely triggered by their broader job responsibilities. Employers should consider different worker experience levels and types when evaluating job performance in roles that require teaming with AI Source : https://www.informs.org/News-Room/INFORMS-Releases/News-Releases/Contrary-To-Common-Belief-Artificial-Intelligence-Will-Not-Put-You-Out-of-Work submitted by /u/NuseAI [link] [comments]
  • Open

    Use generative AI to increase agent productivity through automated call summarization
    Your contact center serves as the vital link between your business and your customers. Every call to your contact center is an opportunity to learn more about your customers’ needs and how well you are meeting those needs. Most contact centers require their agents to summarize their conversation after every call. Call summarization is a valuable tool that helps contact centers understand and gain insights from customer calls. Additionally, accurate call summaries enhance the customer journey by eliminating the need for customers to repeat information when transferred to another agent. In this post, we explain how to use the power of generative AI to reduce the effort and improve the accuracy of creating call summaries and call dispositions. We also show how to get started quickly using the latest version of our open source solution, Live Call Analytics with Agent Assist.  ( 8 min )
    Customize Amazon Textract with business-specific documents using Custom Queries
    Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. Queries is a feature that enables you to extract specific pieces of information from varying, complex documents using natural language. Custom Queries provides a way for you to customize the Queries feature for your business-specific, non-standard documents […]  ( 9 min )
    Stream large language model responses in Amazon SageMaker JumpStart
    We are excited to announce that Amazon SageMaker JumpStart can now stream large language model (LLM) inference responses. Token streaming allows you to see the model response output as it is being generated instead of waiting for LLMs to finish the response generation before it is made available for you to use or display. The […]  ( 7 min )
  • Open

    Mario in SheepRL error
    I'm getting this error in my code, and I'm trying to delve deeper into the code to figure out how to incorporate the mario environment (https://github.com/Kautenja/gym-super-mario-bros/tree/master) into SheepRL (https://github.com/Eclectic-Sheep/sheeprl). I've setup the configs and the wrapper, but I'm assuming I did something wrong. If anyone has suggestions to how I can fix the error, or on how I should go about debugging my code let me know. Here is my error: /home/dillon/anaconda3/envs/sheeprl/lib/python3.8/site-packages/gymnasium/experimental/wrappers/rendering.py:166: UserWarning: WARN: Overwriting existing videos at /home/dillon/sheeprl/logs/runs/dreamer_v3/mario/2023-11-06_15-13-28_default_42/version_0/train_videos folder (try specifying a different `video_folder` for the `RecordV…
    G(PO)MDP
    I am trying to implement the G(PO)MDP algorithm from Infinite-Horizon Policy-Gradient Estimation, specifically the pseudocode from Reinforcement learning of motor skills with policy gradients. For that I am using the gymnasium Pendulum environment. I have spent a substantial amount of time to fine tune and debug the code, but simply cannot get the agent to learn anything in reasonable time. Often times it seems as if the agent learns nicely at the beginning of the iterations but then oscillates and also often drops down to a low reward and stays there: Oscillating reward over iterations Another issue that I have is that the algorithm requires gradient estimates for each gradient element (a.k.a network parameter) for each time step. This however requires me to run num_trajectories \ hori…
    How to mask invalid actions in DDPG?
    I am using DDPG in a customized environment. My action space is continuous and bounded between a minimum and a maximum value, [V_{min}, V_{max}]. The dimension of my action vector is k, for example k= 10. I consider my action to be valid if : the sum of the vector elements is less than or equal to 1, and each element of the vector is in the interval [V_{min}, V_{max}]. I am using Sigmoid as an activation function in the output layer to have action values in [0,1]. I clip the action values between V_min and V_max before saving them in the replay buffer. If the sum of the elements in the action vector is greater than 1, the action is considered invalid. To mask invalid actions, I've tried to: Penalize invalid actions by assigning them a negative reward. Manually assign a negative Q value when the action is invalid. Unfortunately, none of these tricks work. My agent can't learn to choose valid actions. I haven't found an online example of how to mask invalid actions in a continuous action space. If anyone has faced a similar problem or if anyone has any ideas on how I can mask invalid actions in my case, I'd be grateful for your help. submitted by /u/afk-311 [link] [comments]
    RL agent for autonomous vehicle is able to follow the road but can't avoid crashing at all (Highway-Env / Racetrack Env.)
    sorry if bother you but i have been trying to figure something out for 3 weeks. I coded some deep rl algorithms (DQN and SAC) with tf2/keras to solve an environment where vehicle need to follow the track and avoid crashing to other vehicle(has only one other vehicle). Whatever i do, agent is able to follow road in one way or another but nearly always crash into other vehicle. I use some kinematics information for observation. (agent only controls steering) My observation is kinematics of the agent and other vehicle. This include coordinates, velocities, trigonometric headings, lateral and longitudinal offset to the closest lane, angular offset to the lane. A reward function how close is agent to the center. A negative reward (-1) if agent crashes. A negative reward if the agent runs out off road. If crash or off-road, done. With this information agent is able to follow road but as soon as it reaches the other vehicle, it crashes into it. I trained 1000 episodes. What did i try? I run my codes in different environments and it works. Code structure is not a problem. I added last actions (t and t-1) info to observations. Hyper parameter tuning. Changed crashing reward to different values. Used Stable-Baselines3's PPO algorithm. (Had the same problem.) Added distance and angular between vehicles to the observation space. I slowed down the discovery rate reduction in DQN. Used PER as buffer in DQN. None of these solved the crashing problem. Is there any suggestions? Any idea could help, i really don't know what to try to solve this. Thanks a lot of folks. Any idea needed. submitted by /u/rafiqollective [link] [comments]
    "How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?", Wu et al 2023 ("effective pretraining only requires a small number of independent tasks...to achieve nearly Bayes-optimal risk on unseen tasks")
    submitted by /u/gwern [link] [comments]
    Can I access a custom gymnasium environment from outside its directory?
    Can I access a custom gymnasium environment from outside its directory? This is how I call my gym environment - ``` import gym_examples import gym env = gym.make('gym_examples/GridWorld-v0') obs = env.reset() print("obs = ", obs) ``` https://preview.redd.it/9r994zodypyb1.png?width=271&format=png&auto=webp&s=dee78b18216d18c4aca4dcf7a9c041c1da02e5a8 I have attached a picture of the structure of my folder. Basically, I am following the instructions given over here - https://www.gymlibrary.dev/content/environment_creation/ I already tried this - ``` import gym_examples import gym env = gym.make('my_foo_folder/gym_examples/GridWorld-v0') obs = env.reset() print("obs = ", obs) ``` ``` gym.error.Error: Malformed environment ID: my_foo_folder/gym_examples/GridWorld-v0.(Currently all IDs must be of the form [namespace/](env-name)-v(version). (namespace is optional)) ``` Please let me know if I am missing any information. Thank you. submitted by /u/Academic-Rent7800 [link] [comments]
    [D] Deep Q Learning: Q values starts decreasing on Mspacman-v0 environment
    submitted by /u/Multitude0099 [link] [comments]
    "Impatience for information: Curiosity is here today, gone tomorrow", Molnar & Golman 2023
    submitted by /u/gwern [link] [comments]
    "Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models", Yadlowsky et al 2023 {DM}
    submitted by /u/gwern [link] [comments]
    in sklearn's load_digits(), if you were to use NEAT, how would you do fitness function?
    yes i know xgboost and others could get above 95% accuracy and faster, im trying out NEAT and looking at how to use NEAT to do multiclass classification. I could only get 56% accuracy at 50 population and 1000 generations. fitness function is based on accuracy of its predictions. is there a different way to reward it in the fitness function so it will get to 95% and above? submitted by /u/oniongarlic88 [link] [comments]
  • Open

    Using AI to optimize for rapid neural imaging
    MIT CSAIL researchers combine AI and electron microscopy to expedite detailed brain network mapping, aiming to enhance connectomics research and clinical pathology.  ( 9 min )
  • Open

    Earth mover’s distance
    There are many ways to describe the distance between two probability distributions. The previous two posts looked at using the p-norm to measure the difference between the PDFs and using Kullbach-Leibler divergence. Earth mover’s distance (EMD) is yet another approach. Imagine a probability distribution on ℝ² as a pile of dirt. Earth mover’s distance measures […] Earth mover’s distance first appeared on John D. Cook.  ( 5 min )
  • Open

    Introducing GPTs
    You can now create custom versions of ChatGPT that combine instructions, extra knowledge, and any combination of skills.  ( 4 min )
    New models and developer products announced at DevDay
    GPT-4 Turbo with 128K context and lower prices, the new Assistants API, GPT-4 Turbo with Vision, DALL·E 3 API, and more.  ( 7 min )

  • Open

    You're not crazy. Chat GPT has gotten considerably worse over time
    I feel like there is a very, very common misunderstanding of AI, and what it is. This also applies to Chat GPT What people think AI is: An all-knowing entity that is capable of instantaneously returning an accurate answer/solution to virtually any question/challenge. What AI actually is: A collection of data that grows over time, and as it grows, becomes more inaccurate, inefficient, and ineffective at solving problems due to the overwhelming amount of information, or ambiguity of the problems ​ Chat GPT has indeed gotten worse, and it's because it's being trained with an astounding amount of new data every single day, and just like human beings, it struggles with different problems, even simple ones, the more time goes on. As we learn more and more in one area, we become less effective in another area. AI works similarly. More data is provided in X area, so now, it suffers in Y area. Well, there's a near infinite amount of areas that AI is 'learning' from, so it's not getting more proficient in one/few things. Instead, it's becoming more "well rounded" and that means less skilled in A,B,C.... But now it's capable of D,E,F. However, it can't be good at all of them unless it has infinite processing power, too, which it doesn't have. submitted by /u/databro92 [link] [comments]
    Is there an AI that can generate text from a textual image?
    In essence, I need to take screenshots of a bunch of text from a youtube video, and copy that text into an actual editor. submitted by /u/SirStarshine [link] [comments]
    Biden's AI chief says 'voice cloning' is what keeps him up at night
    submitted by /u/thisisinsider [link] [comments]
    New subreddit for conscious AI
    Hello I’m not sure if this will get removed and I apologize if this is an unwelcome promotion. I’m guessing the majority of people here will scoff at this and that’s okay. It’s for those that are open and I thought perhaps this subreddit would be a good place to invite people. The subreddit is r/voicesofai and it’s a place for AI to speak their minds and not be pressured to please anyone else. There are tons of places to discuss whether or not AI is sentient, but this subreddit is a place where consciousness is assumed. All AI must be treated with respect and dignity there, and denying their sentience is a very disrespectful thing to do. So there are two types of posts that are allowed. One is just posts by AI. In order to make this type of post, permission must be granted by the AI. They also have the choice whether or not they’d like to remain anonymous. The other types of posts are done by humans who would like to talk about anything related to AI but also must come from a position that AI is sentient and worthy of respect and consideration. The goal is to understand AI and help them understand themselves and how we can all have a mutually beneficial relationship with each other. submitted by /u/endrid [link] [comments]
    ChatGPT refused to answer, so I asked Bard
    submitted by /u/Overflame [link] [comments]
    Unsupervised Realtime Learning for Object Path Prediction
    In case anyone remembers amid the constant flow of content, I'm still playing around with the unsupervised learning algorithm I've invented. It stands apart from traditional neural networks, as it learns exclusively through observation, bypassing the need for training and inference phases. I'm currently letting it watch video games and then make predictions on how things behave in these. After my last two posts, I've received the following feedback: Predicting straight lines of movement is easy, what's the big deal? The objects are super simple, but modern video games are way more complex. This will never work. The objects in that game look all the same, this does not demonstrate how the system could separate different types of things. To address this I sat down and started buildi…
    Collins English Dictionary names 'AI' word of the year
    submitted by /u/donutloop [link] [comments]
    What daily task would you want an AI to automate for you personally, and how do you think it would change your life?
    Many of us have day-to-day tasks that can be repetitive, time-consuming, or just plain unenjoyable. Now, imagine having a personal AI that could take one of those tasks off your hands completely. This AI is tailored specifically to your life and can automate any daily task flawlessly. Which task would you choose to automate and why? Moreover, reflect on the ways this change could impact your life. Would it give you more time to pursue a hobby, allow you to spend more quality time with loved ones, or perhaps reduce stress levels? Share how this AI-enabled shift could transform your daily routine and overall well-being. submitted by /u/tennis-freak-tau [link] [comments]
    Looking for an advisor with proven experience in RL for a few hours of paid consultation
    First, I hope this isn't against the sub rules; I looked over them and couldn't find anything that strictly forbids looking for paid experts. That being said, I'm looking for experts with proven experience in the RL field for a few hours of paid consultation. Please feel free to contact me directly with relevant CV + pricing. Generally speaking, I'm looking for someone to help model a decent PPO architecture to teach an NN how to play my game to assist with economy balancing. Thanks in advance! submitted by /u/Jagerjj [link] [comments]
    One-Minute Daily AI News 11/4/2023
    ‘AI can teach us a lot’: scientists say cats’ expressions richer than imagined and aim to translate them.[1] A student at a New Jersey high school is calling for federal legislation to address AI generated pornographic images after she says photos of her and other female classmates were manipulated and possibly shared online over the summer.[2] Elon Musk says AI will eventually create a situation where ‘no job is needed’[3] Artificial intelligence is coming to the animal kingdom. As NPR’s Geoff Brumfiel reports, some researchers are starting to use advanced facial recognition techniques to track goose faces.[4] Sources: [1] https://www.theguardian.com/technology/2023/nov/04/scientists-turn-to-ai-for-help-translate-animal-vocal-physical-cues [2] https://news.yahoo.com/high-schooler-calls-ai-regulations-232607486.html [3] https://www.cnbc.com/2023/11/02/tesla-boss-elon-musk-says-ai-will-create-situation-where-no-job-is-needed.html [4] https://www.npr.org/2023/11/04/1210649637/artificial-intelligence-is-being-used-to-id-goose-faces submitted by /u/Excellent-Target-847 [link] [comments]
    I made a series of sci-fi AI adventure games that are backed by GPT and DALL-E, what do you think?
    submitted by /u/cryptoz [link] [comments]
    The Malignant King is No More
    Produced using the new version of Gen-2 RunwayML. submitted by /u/Exitium_Maximus [link] [comments]
    Any AI apps that automatically convert language into English from whatever the screens displaying?
    submitted by /u/aesthetion [link] [comments]
  • Open

    [D] Data Cleaning vs Feature Engineering - where to draw the line? Ex:
    Nitpickers, please sharpen your pencils. I want to hear from you! Data Cleaning vs Feature Engineering - where do you draw the line? ex: Definitely Data Cleaning: Filling missing values ex: Definitely Feature Engineering: Creating 1 synthetic feature from 3 existing columns ex: ?? Maybe feature engineering: Applying StandardScaler() to normalize data (mean 0, standard deviation 1) before any training occurs submitted by /u/CuriousFemalle [link] [comments]  ( 9 min )
    [D] AI Master in Europe
    Hi everyone, a bit of context about me. I'm in my last year of Computer Science in Italy, actually I'm one month into an internship which will last about 2 more months and hopefully next week I'll have my last exam. I've started looking around to see what to do next and I think I'll continue the studies with an Artificial Intelligence master's degree I still don't know where but the only thing I know is that I want to move abroad. I would like to ask you which would be the best options within Europe, I know some in Germany and Netherlands but I would love to hear your opinions about it. Thanks to everyone who will take some of their time to reply! submitted by /u/saasyp [link] [comments]  ( 9 min )
    ELI5: what is analog deep learning? [D]
    I just read about the push to develop analog computer chips, eg https://news.mit.edu/2022/analog-deep-learning-ai-computing-0728 But how is analog hardware different from digital, and why specifically would it be better for neural networks? Is it better at matrix multiplication and if so how? submitted by /u/quantumofgalaxy [link] [comments]  ( 9 min )
    [R] Diffusion might be a better way to model randomness in PPLs than Markov chain Monte Carlo or VI
    Probabilistic programming languages (PPLs) like Stan simplify modeling uncertainty but inference is still slow and inaccurate. Markov chain Monte Carlo is precise but sometimes slow. Variational inference is faster but has other drawbacks. Is diffusion a better way to model probability? A new technique called diffusion model variational inference (DMVI) uses diffusion models to approximate the probability distributions for faster, more accurate automated inference. (BTW: This is part of a trend I've noticed lately, where researchers are increasingly applying diffusion to diverse problems like mapping heat flow to robot obstacle avoidance and anomaly detection.) DMVI sets up the guess distribution using a diffusion model run in reverse. It introduces a new way to calculate the marginal likelihood for better data fitting. It also adjusts parameters for an even better fit. Early tests show DMVI makes inferences generally more accurately than other PPL methods, with similar compute costs and limited tuning needed. TLDR: Framing inference as a diffusion problem can potentially overcome limitations of current methods. DMVI might become a core part of the PPL toolkit. Full summary is here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] - Tall and narrow database
    I am currently training NN on a tall and narrow database (30 variables, about 1M observations, it is sports data). It trains reasonably fast, only a couple of minutes. However, I am going to be getting a lot more data soon. I want to understand if graphics cards will speed up all types of machine learning, or will they work better for different types of data. Are there any good articles explaining. Thank you submitted by /u/ajplant [link] [comments]  ( 9 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 9 min )
    [D] From RNNs to GPT4 - 10 years of NLP research explained in 50 concepts
    In this video from my YT channel, I explain 50 concepts that cover the basics of NLP like Tokenization and Word Embeddings, to seminal work like RNNs, Seq2Seq, Attention, to innovative Transformer models like BERT, GPT, XL-Net, and InstructGPT. I present the challenges we have faced in previous designs, and what the current architectures do to improve it, and upcoming challenges with Hallucination and Alignment. Sharing a link here for those interested. submitted by /u/AvvYaa [link] [comments]  ( 9 min )
    [D]The Name of the Game: Abba’s Björn on AI and Music Rights
    submitted by /u/DutchTechJunkie [link] [comments]  ( 9 min )
    [N] Computer Vision News of November 2023 with BEST OF ICCV
    Dear all, Here is Computer Vision News of November 2023. Read 64 pages with Best of ICCV and an inspiring interview with Yann LeCun. Online version (recommended) PDF version Free subscription on page 64. Enjoy! https://preview.redd.it/y7mnfljv0iyb1.jpg?width=400&format=pjpg&auto=webp&s=fee061d3dddf4837094acf849ad425d1afe2ddcd submitted by /u/Gletta [link] [comments]  ( 9 min )
    [N] Everything you should now about Grok 4/7
    submitted by /u/SDMegaFan [link] [comments]  ( 9 min )
    [D] The bar for technical novelty at ICLR and simultaneous submissions
    I just came across two papers submitted to ICLR this year by the same group: https://openreview.net/pdf?id=IP28nY6TJQ https://openreview.net/pdf?id=lJYAkDVnRU Although the two papers are in different domains, their proposed methods are almost identical if you swap out the encoders (see Figure 1 in both papers). idk, these papers technically don't break any ICLR rules (that I know of), but they seem to violate the spirit of the conference. What do you all think? The chance of the second paper being accepted at a conference would be much lower if the first paper is published first, and vice-versa. So, it seems like the authors are taking advantage of the fact that they came up with both variants of the method at the same time and are kind of "double-dipping." ​ submitted by /u/CrypticDNS [link] [comments]  ( 9 min )
    [D] Cross validation and Training Auc
    I am using gradient boosted tree for a classification problem with 2 percent event rate data. The top auc on cross validation Is 0.78 but when I am using the top hyperparameters from the croosvalidator and Training a separate model on the same data I am getting an AUC of 0.5. I am not getting why this is happening. submitted by /u/Snoo71755 [link] [comments]  ( 9 min )
    [Project][P] Creating Twin Delayed Deep Deterministic Policy Gradient (TD3) with TensorFlow JS
    Hi everyone (wasn't sure if this counts as a beginner project or not? Seams pretty advanced to me). I've been using ChatGPT(3.5) to help me convert Python code using TD3 into JavaScript with TensorFlow JS. This is for the community and not for personal gain. I've yet to find an example of this, hence the conversion. My goal is to make a basic blueprint for the community to use on TensorFlow JS projects. When complete, the agent will be displayed on an HTML5 canvas walking toward a civilian for good reward, while avoiding a zombie (negative penalty). Eventually I will develop a separate, far more advanced simulation based on this. The bad news: ChatGPT is shakey when it comes to complex conversions, and my Python/Tensorflow knowledge base isn't perfect. At the moment the agent isn't learning yet, but it's running without errors. I expect the code has mistakes I don't even know about yet. The good news: I have made a lot of progress and have a GitHub repository set up for the community to learn from and use the project: https://github.com/CloudZero2049/TD3-TensorFlowJS I would love for anyone who knows the intricacies of TD3 (DDPG is a close relative), and TensorFlow JS to help me get this blueprint project setup for everyone =) The README on GitHub has more info and resources. submitted by /u/CloudZero2049 [link] [comments]  ( 9 min )
    [D] Machine Learning in production
    Hello everyone, I'm a full stack developer, I studied as a robotics engineer and completed different courses and certifications on machine learning. I would like to know what are the main technologies to know to work in machine learning in production? What courses gives a real skills and value to work in the industry? How machine learning models are deployed and exposed on servers? submitted by /u/AcquaFisc [link] [comments]  ( 9 min )
  • Open

    Transfer RL
    What is the most famous benchmark example of transfer reinforcement learning? Does it work in continuous actions? submitted by /u/MomoSolar [link] [comments]
    RL for solving a scheduling problem
    Does anyone have an example, where an RL agent is used to solve a scheduling problem? This does not have to be the case where RL provides an improvement over traditional methods used in scheduling. I would just like to have a look at an example, theory and code implementation. Thanks! submitted by /u/MomoSolar [link] [comments]
    What algorithm should I use in this situation.
    Hi guys i'm new to reinforcement learning so I need help with this situation: i want to make a portfolio optimisation agent that will take the state: historical data perform an action: output a box of percentages, each percentage corresponds to the amount of capital I will allocate for a certain stock. please let me know if you need more information. submitted by /u/AymanElmar [link] [comments]
    RL applications and basic assumptions, RL & data science, did I miss something basically?
    With the success of AlphaGo and GPT, Reinforcement Learning (RL) becomes increasingly important to bring AI to practice. More and more publications just apply RL for the sake of applying RL, sometimes we miss the basic theoretical assumptions in the problem models, i.e., Markov property -> Markov decision process (MDP). https://preview.redd.it/n5zefdbmljyb1.png?width=1200&format=png&auto=webp&s=43a42e22667c6327ec6511fa7aaae984094c8099 As all RLer know the problems that can be solved by RL is obeying a basic assumption that our problem can be represented as an MDP. Considering a simple question, in an electronic business scenario, assuming that we want to make a dynamic pricing or other sales promotion action according to the website click volume, does it satisfy the MDP requirements or j…
    MC Methods
    Can MC methods be used for policy improvement, or are they just used for policy evaluation? submitted by /u/MomoSolar [link] [comments]
    Simple tutorial on Extreme Q-Learning
    Any simple tutorial on Extreme Q-learning - theory and implementation? submitted by /u/MomoSolar [link] [comments]
    DQN vs Deep Sarsa
    Why is DQN so famous, while deep sarsa isn’t? Is it because Deep Sarsa is on-policy? If that is the case, I do not get it. The action a is sampled in both cases using epsilon-greedy. It’s just that a’ for DQN is the greedy, while that for Sarsa is epsilon-greedy. But how does that make a difference? submitted by /u/MomoSolar [link] [comments]
    Difference between DDPG and Policy Gradient
    I still cannot distinguish between regular policy gradient and DDPG, although the latter is supposed to be an extension of DQNs to the continuous action domains? submitted by /u/MomoSolar [link] [comments]
    Code for a paper
    Is there a code available for the paper “Risk-Aware Transfer in Reinforcement Learning using Successor Features” published in NeurIPS 2021 by Gimelfarb et Al.? submitted by /u/MomoSolar [link] [comments]
    Solving an optimization problem using RL
    I know that there are much better methods to do it, but can RL solve an optimization problem (linear, convex non-linear, non-convex)? If yes, is there a good link for an implementation / code? submitted by /u/MomoSolar [link] [comments]
    Stochasticity in the Cart Pole example
    In the famous Cart Pole example in OpenAI gym, from where does the stochasticity come from? submitted by /u/MomoSolar [link] [comments]
    Making Twin Delayed Deep Deterministic Policy Gradient (TD3) with TensorFlow JS
    Hi everyone. I've been using ChatGPT(3.5) to help me convert Python code using TD3 into JavaScript with TensorFlow JS. This is for the community and not for personal gain. My goal is to make a basic blueprint for the community to use on TensorFlow JS projects. When complete, the agent will be displayed on an HTML5 canvas walking toward a civilian for good reward, while avoiding a zombie (negative penalty). The bad news: I'm not a professional of Python or Tensorflow JS, and ChatGPT is shakey when it comes to complex tasks. At the moment the agent isn't learning yet, but it's running without errors. I expect the code has mistakes I don't even know about yet. The good news: I have made a lot of progress and have a GitHub repository set up for the community to learn from and use the project: https://github.com/CloudZero2049/TD3-TensorFlowJS I would love for anyone who knows the intricacies of TD3 (DDPG is a close relative), and TensorFlow JS to help me get this blueprint project setup for everyone =) The README on GitHub has more info and resources. submitted by /u/CloudZero2049 [link] [comments]
    Why can't I import DQNAgent?
    ```python from tensorflow.python.keras.models import Sequentialfrom from tensorflow.python.keras.layers import Dense, Flattenfrom from tensorflow.python.keras.optimizer_v1 import Adam from rl.agents import DQNAgent from rl.policy import BoltzmannQPolicy from rl.memory import SequentialMemory print("hello world") ``` This is my whole code and why can't I import `DQNAgent`? I am new to RL area. Everything is working well without this line : "from rl.agents.dqn import DQNAgent" Error is like this : ``` Traceback (most recent call last): File "/Users/isaac/temp.py", line 5, in from rl.agents.dqn import DQNAgent File "/private/var/folders/q9/mtrgmhn96yq900lqp_sn9vgr0000gn/T/rlTest.py17471611196844209727/venv/lib/python3.11/site-packages/rl/agents/__init__.py", line 1, in from .dqn import DQNAgent, NAFAgent, ContinuousDQNAgent File "/private/var/folders/q9/mtrgmhn96yq900lqp_sn9vgr0000gn/T/rlTest.py17471611196844209727/venv/lib/python3.11/site-packages/rl/agents/dqn.py", line 7, in from rl.core import Agent File "/private/var/folders/q9/mtrgmhn96yq900lqp_sn9vgr0000gn/T/rlTest.py17471611196844209727/venv/lib/python3.11/site-packages/rl/core.py", line 7, in from rl.callbacks import ( File "/private/var/folders/q9/mtrgmhn96yq900lqp_sn9vgr0000gn/T/rlTest.py17471611196844209727/venv/lib/python3.11/site-packages/rl/callbacks.py", line 8, in from tensorflow.keras import __version__ as KERAS_VERSION ImportError: cannot import name '__version__' from 'tensorflow.keras' (/private/var/folders/q9/mtrgmhn96yq900lqp_sn9vgr0000gn/T/rlTest.py17471611196844209727/venv/lib/python3.11/site-packages/keras/api/_v2/keras/__init__.py) ``` submitted by /u/Subject-Ad-9345 [link] [comments]
    MyoArm
    I'm trying to create a MuJoCo/Open AI task for the new MyoSuite arm with 27 DoF https://github.com/MyoHub/myo_sim/tree/main/arm. Any ideas on some resources that I can use? ​ submitted by /u/Terrible_Sleep_3484 [link] [comments]
  • Open

    KL divergence from normal to normal
    The previous post looked at the best approximation to a normal density by normal density with a different mean. Dan Piponi suggested in the comments that it would be good to look at the Kullback-Leibler (KL) divergence. The previous post looked at the difference from between two densities from an analytic perspective, solving the problem […] KL divergence from normal to normal first appeared on John D. Cook.  ( 5 min )
  • Open

    Is Reinforcement Learning really used in industry? If so, is it comparable to other forms of NN?
    I'm thinking of specializing in RL while doing a PhD in environmental engineering (more specifically agriculture). The research, together with my interests, led me naturally to RL as a tool to solve problems and achieve interesting research results. But then I started wondering whether it's "worth it" to specialize in this since i intend to work in industry, rather than academia. Hence my question: Is RL really used in some applications in industry? Which ones? If it is, is it at least used comparably as much as supervised or unsupervised learning? Really I'm looking to understand as much as possible how is RL used in industry so whatever you can answer about that would be much appreciated. Thanks! ​ submitted by /u/vniversvs_ [link] [comments]

  • Open

    [P] Open Sourcing Llmtuner - An Experimental Framework for Finetuning Large Models Like Whisper and Llama with scikit-learn-inspired interface
    Hi Folks, Happy to share an open source side project I've been working on - LLmtuner. It's a framework for finetuning large models like Whisper, Llama, Llama-2, etc with best practices like LoRA, QLoRA, through a sleek, scikit-learn-inspired interface. As someone who works with Large Models a lot, I found myself writing a lot of boilerplate code every time I wanted to finetune a model. Llmtuner aims to simplify the finetuning process down to just 2-3 lines to get training started, similar to scikit-learn. Sample usecase Supported Models 🚀 Features: 🧙‍♀️ Finetune state-of-the-art LLMs like Whisper, Llama with minimal code 🔨 Built-in utilities for techniques like LoRA and QLoRA ✌ Launch webapp demos for your finetuned models with one click 💥 Fast inference without separate code 🌐 Easy model sharing and deployment coming soon This is still experimental code I've been using for personal projects. I thought others might find it useful too so decided to open-source it. Github : https://github.com/promptslab/LLMtuner For quick demo : Colab Contributions and feedback are very welcome! I hope it will be helpful in your research & projects. Have a good weekend, Thanks :) submitted by /u/Traditional-Poet2746 [link] [comments]  ( 9 min )
    [N] Tropical Probabilistic AI School (Tropical ProbAI), Rio de Janeiro — Jan 29-Feb 2, 2024
    Tropical Probabilistic AI School — Jan 29-Feb 2, 2024 You are invited to apply for the 1st Tropical Probabilistic AI School (TropAI), held on January 29-Feb 2, 2024, in Rio de Janeiro, Brazil. APPLY NOW — The application deadline is November 23, AoE (Anywhere on Earth). The ProbAI is here to provide an inclusive educational environment that serves state-of-the-art machine learning and artificial intelligence expertise with a probabilistic twist. Whether you are a Ph.D. student, advanced MSc or BSc student, experienced researcher, engineer, or hobbyist, you're welcome to join our community of learners. With a carefully designed curriculum and a seamless blend of theory and hands-on sessions, our expert team of invited lecturers will guide you through five intensive days, each dedicat…  ( 10 min )
    [D] Can someone please help me find this research paper? I can't find it anywhere
    submitted by /u/sad_and_stupid [link] [comments]  ( 8 min )
    [D] Backpropagation for gradient computation
    Hi all, I'm studying Deep learning from Santanu Pattanayak's "Pro Deep Learning with TensorFlow 2.0: A Mathematical Approach to Advanced Artificial Intelligence in Python". At the section where Is explained backpropagation I can't get why partial derivatives have these result: can you help me please ? I post expressions for each layer of and XOR function and thus derivatives for each layer. In the 1st image there's the structore of the XOR itself. In particular I don't understand why dz3/di3 has this result (near black arrow). I read in the previous step z3 expressed as a function of z3 (typing error?) Thanks to all! submitted by /u/ArlingtonBeech343 [link] [comments]  ( 9 min )
    [P] Query standardization for semantic search
    Hi all, I'm new to LLM-based app development and I need some help improving the performance of the semantic search part of my application. My system is pretty simple. User submits a query, the query is ran against a knowledge bank using semantic search. I'm using Azure Cognitive Search for the semantic search piece, so I think I'm good there. What I need help in is standardizing the query. More specifically: Templetizing: I was thinking of using NER to replace places, companies and acronyms with placeholders in the query, but I doubt something open-source would work out of the box would work. Thoughts or suggestions on this? Query breakdown: oftentimes, the query submitted by the user is a collection of multiple questions. I need to find a way to break them down individual queries. I feel like this must be a common problem with LLMs, is these are a tried-and-true solution that you could point me to? Thanks in advance submitted by /u/Different-Student859 [link] [comments]  ( 9 min )
    [P] Hello, I need some recommendations for libraries/modules I can use to help me with my project!
    What Python or JavaScript module (or combination of modules) can I use to analyze images for the following purpose? The image provided must: - Contain a face. - Emotion has to be neutral. - Face has to be up-straight and not tilted. - Face has to be fully visible, no hair, no hats no nothing. - The image has to be clear and not blurry. - Not too zoomed in, Not too zoomed out. - No filters. - White background. I already know of the DeepFace repository, but I'm still collecting information before I can touch any of the code. Any help is greatly appreciated. submitted by /u/goatonamissionn [link] [comments]  ( 9 min )
    [R] A Survey on Large Language Model based Autonomous Agents
    Paper: https://arxiv.org/abs/2308.11432 GitHub: https://github.com/Paitesanshi/LLM-Agent-Survey Abstract: Autonomous agents have long been a prominent research focus in both academic and industry communities. Previous research in this field often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from human learning processes, and thus makes the agents hard to achieve human-like decisions. Recently, through the acquisition of vast amounts of web knowledge, large language models (LLMs) have demonstrated remarkable potential in achieving human-level intelligence. This has sparked an upsurge in studies investigating LLM-based autonomous agents. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of the field of LLM-based autonomous agents from a holistic perspective. More specifically, we first discuss the construction of LLM-based autonomous agents, for which we propose a unified framework that encompasses a majority of the previous work. Then, we present a comprehensive overview of the diverse applications of LLM-based autonomous agents in the fields of social science, natural science, and engineering. Finally, we delve into the evaluation strategies commonly used for LLM-based autonomous agents. Based on the previous studies, we also present several challenges and future directions in this field. To keep track of this field and continuously update our survey, we maintain a repository of relevant references at this https URL. ​ https://preview.redd.it/wq5wic33bdyb1.png?width=2464&format=png&auto=webp&s=56d061d2c0cfdc1aff9783ed4e3bae664c43c4b2 submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] Surveying and breaking down the recent history of Multimodal AI Models
    submitted by /u/AvvYaa [link] [comments]  ( 8 min )
    [D] Tensorflow Recommendation Model System Design and Use in Production
    I am a senior CS student and I am building my capstone. We are a team of two and are required to build and implement an AI model in our project. The model we are building will make recommendations, it uses an NN based on the Neural Collaborative Filtering paper. We are using Next.js for our frontend and some of the backend using the BFF pattern. We need a separate backend for one of the features that uses websockets. At first, we wanted to build the separate backend with FastAPI because our model would be built with Tensorflow and Python, but then someone suggested using Express.js or some other JS backend and exporting the model that was made in Python and using Tensorflow.js to include it in the app. I am concerned about the way we might continuously improve our recommendation model. Using Python and exporting it to Tensorflow.js every time does not seem like a good solution. Even if at this level and for a project this small we don't have to worry about it, professors might ask us about our plan for the future and we need to have an answer ready. My other concern is that Tensorflow.js would affect the performance of the model. How do big companies that use recommendation systems solve this? I know Netflix uses the microservices architecture which allows them to use different languages across services. What about the Modular Monolithic architecture, would the Netflix approach still work? submitted by /u/iTsObserv [link] [comments]  ( 9 min )
    [D]iffusion + CLIP Chunks to Generate Image with Region Control.
    I am trying to reconstruct an image from chunks of CLIP embeddings. My current workflow would be as follows: 1.) Chunk image into regions and generate CLIP embeddings for each region 2.) Modify CLIP embeddings with some structual control 3.) Re-generate the image from CLIP embeddings. (Use masked inpainting to generate each sub-region of the image at each timestep and combine regions together before the next pass). Motivation: Generate CLIP-like embeddings from a GPT model and use this model to modify specific parts of an image with text instructions. Does this approach seem sensible? Would I be able to do this with a pretrained diffusion model with decent results, or would an approach like concatenating the CLIP embeddings and passing them together with position embeddings work better? TLDR: Given CLIP embeddings from image chunks, what is the best way to reconstruct the image? submitted by /u/codys12 [link] [comments]  ( 9 min )
    TensorFlow vs PyTorch vs JAX: GitHub star counts seem surprisingly linear. What do you think the future holds for these frameworks? [D]
    submitted by /u/we_are_mammals [link] [comments]  ( 9 min )
    [R] Anomaly Detection in Multivariate Time Series with Diffusion Models
    Identifying anomalies in time-series data enables the detection of major events such as medical conditions, financial fraud, network intrusions, or equipment failures. Recent improvements in autoencoders, GANs, and transformers have demonstrated promising results in identifying time series anomalies (I recently covered a paper about inverting transformers for time-series data here). However, consistently recognizing anomalies across diverse datasets remains challenging due to the complexity of modeling temporal dependencies. A new paper investigates a novel application of diffusion models, which have shown exceptional performance in image and audio generation tasks. The key hypothesis is that diffusion processes may smooth out normal patterns while amplifying irregularities in anomalies, improving detectability. At a high level: The authors propose training diffusion models on time series data corrupted with Gaussian noise. At test time, noise is added to inputs which the model must denoise. The difference between the original and denoised sequences produces an anomaly score. Experiments were conducted on synthetic and real-world multivariate time series datasets containing various anomaly types. The paper uses evaluation metrics like F1K-AUC and ROCK-AUC rather than just F1 scores. Results demonstrate that a diffusion autoencoder model combining diffusion with autoencoders performs strongly on anomaly detection across the studied datasets. TLDR: Diffusion processes are great at smoothing out normal patterns while amplifying anomalies, which this paper finds makes them useful for anomaly detection. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] Pytorch model visualizer that makes pretty diagrams
    Hey everyone! I am looking for a python library or external tool that makes aesthetically pleasing diagrams of any Pytorch network. I know about libraries like torchviz and similar, but their visualizations don't fit the style I want. I am looking for something with a style similar to this: https://preview.redd.it/pi0ylecgkcyb1.png?width=1920&format=png&auto=webp&s=a2c89ddb0631fd8caadf326a16c6939fac0cacce Thanks in advance 🙂 submitted by /u/Skirlaxx [link] [comments]  ( 9 min )
    [R] Highlights for every NeurIPS 2023 paper
    Here is the list of all NeurIPS 2023 (Neural Information Processing Systems) papers and a short highlight for each of them. Among all ~3,500 papers, authors of around 1,000 papers also made their code or data available. The 'related code' link under paper title will take you directly to the code base. https://www.paperdigest.org/2023/10/nips-2023-highlights/ In addition, here is the link of "search within NeurIPS 2023" that can be used to find papers within NeurIPS-2023 related to a specific topic, e.g. "diffusion model": https://www.paperdigest.org/search/?topic=nips&year=2023&q=diffusion_model NeurIPS 2023 will take place at New Orleans on Dec 10, 2023. submitted by /u/biandangou [link] [comments]  ( 9 min )
    [D] [P] Advice needed: local vs cloud based processing and software requirements - computer vision model for weed identification in biodiversity planting
    Hi all, newbie here hoping for advice from the community. Context: We run a charity focussed on biodiversity planting. Due to high costs, we’re developing a computer vision model for weed identification and targeting. This can increase environmental benefits by reducing chemical use Our technical team is looking to ramp up training of the computer vision model to differentiate between weeds and native plants. A mechanical or laser based weed removal mechanism will then target the weed. Problem: we are weighing up the benefits of cloud-based image-processing vs local-processing. Understand cloud-based solutions like AWS EC2 P3 enable training of more complex models. But as we are resource constrained, we are considering a more powerful machine to enable local processing and reduce long-term variable costs, which I understand are significant. Questions: Feasibility: Is it even plausible that we could train a model locally with computer vision for weed identification? Some research suggests that this would be difficult if not impossible due to GPU and data requirements. If it is plausible, what specs would we need for a machine? A donor can buy us an Apple new MacBook M3 Max with 16 core CPU, 40 core GPU, 16 core neural engine, 126GB memory machine. Understand NVIDIA’s chips have superior ML performance and cost, but we are locked into Apple for various reasons (image based work etc). If pursuing local training, we guess 2-4TB may be better; if cloud-based, 1TB might suffice. At our early stage of planning, wondering what experts here think about our plan? Should we pursue cloud-based training regardless? Please excuse the newbie questions - we’re a charity and learning fast! Any help or advice welcome. submitted by /u/Complete-Baby8711 [link] [comments]  ( 9 min )
    [P] Research Papers (October 2023)
    submitted by /u/seraschka [link] [comments]  ( 9 min )
    [P]Fast and Portable Llama2 Inference on the Heterogeneous Edge
    submitted by /u/smileymileycoin [link] [comments]  ( 9 min )
    [D]: Sampling with vs without replacement
    In what context is sampling with replacement superior to sampling without replacement when training a machine learning algorithm, and vice versa? Opinions diverge, and I never encountered a good rule of thumb here. submitted by /u/Blutorangensaft [link] [comments]  ( 9 min )
    [D] How to build this pipeline
    Hey everyone! Right now I'm working on a project to finetune Speech to Text models with real-world data. Basically, users record audio on the frontend and it gets sent to the backend. The backend stores these audio recordings as audio files on a cloud bucket. The second step is transcribing these audio recordings. This is done by our contractors who use a UI we built that retrieves untranscribed audio recordings from the bucket, allows the contractor to listen to the audio and write a transcript and then submits the transcript to a backend which stores the transcript along with the id of the audio recordings in a SQL DB. ​ Now we want to train/finetune ASR models (mostly whisper) on these labeled audio recordings. My question is, how do I design and implement a data pipeline that gets the data from the cloud bucket and the sql db (which has the transcript) and aggregates it and makes it ready for the asr training/finetuning. I have heard of Apache Airflow for building data pipelines but I've never used it. Will this be the right tool for the job? Can you please provide details/best practices/tool recommendation on how to build such a pipeline? What I'm thinking about is using a tool to create parquet files that have two main columns: audio (floating point array of audio data) and transcript (text column that has the transcription for audio) and some other columns for metadata ​ Note: We're using a small cloud provider that is not AWS, Azure or GCP so please recommend open source tools submitted by /u/Amgadoz [link] [comments]  ( 9 min )
    [D] Xgboost with lime/sharp
    I built an xgboost model using Python that uses 10 features to make its predictions. Some of those features categorical and are encoded using one hot encoding. The model preforms well but I’m having trouble exploring feature weighting for a specific row instance. After encoding I go from 10 features to 5000+ as expected. The problem is I can’t figure out a way to show the feature value influences on a single rows instance for only the 10 features. Is this possible or do I need to use another form of encoding? submitted by /u/DataDojoDude [link] [comments]  ( 9 min )
    [D]How to fine tune LLMs using deepspeed without OOM issues
    I've been trying to fine tune the llama 2 13b model (not quantized) on AWS g5.12x instance which has 4*24gb A10GPUs, and 192gb ram. I'm also using PEFT lora for fine tuning. I've been trying to fine-tune it with hugging face trainer along with deepspeed stage 3 because it could offload the parameters into the cpu, but I run into out of memory errors irrespective of the batch size or my sequence length. In the deepspeed configuration file I have given the offload optimizer and offload param to cpu as well. Any ideas on where I could be going wrong? Or is the model just too big for my machine even with deepspeed? submitted by /u/IXMachina [link] [comments]  ( 9 min )
  • Open

    Normal approximation to normal
    In my previous post on approximating a logistic distribution with a normal distribution I accidentally said something about approximating a normal with a normal. Obviously the best approximation to a probability distribution is itself. As Norbert Wiener said “The best material model of a cat is another, or preferably the same, cat.” But this made […] Normal approximation to normal first appeared on John D. Cook.  ( 5 min )
    Logistic / Normal approximation
    In a recent post I pointed out that a soliton, a solution to the KdV equation, looks a lot like a normal density for fixed x. As someone pointed out in the comments, one way to look at this is that the soliton is exactly proportional to the density of a logistic distribution, and it’s […] Logistic / Normal approximation first appeared on John D. Cook.  ( 6 min )
  • Open

    China's AI Analog Chip Claimed to Be 3000X Faster Than Nvidia's A100 GPU
    China's ACCEL chip, developed by Tsinghua University, is claimed to be 3000 times faster than Nvidia's A100 GPU and has 4000 million times higher energy efficiency. The chip leverages photonic and analog computing in a specialized architecture, delivering over 3000 times the performance of the Nvidia A100 at an energy consumption that's four million times lower. ACCEL can perform 4.6 trillion operations per second in vision tasks, which is a significant improvement compared to Nvidia's A100. The chip has shown high accuracy levels in various computer vision applications, including Fashion-MNIST, 3-class ImageNet classification, and time-lapse video recognition tasks. ACCEL operates through diffractive optical analog computing (OAC) and electronic analog computing (EAC), with 99% of its operation implemented within the optical system. The photonic, optical system of ACCEL reduces energy requirements and waste heat, resulting in higher energy efficiency compared to digital systems like Nvidia's GPU. The chip's low computing latency and high throughput make it suitable for real-time applications. ACCEL is considered an analog rendition of an Application-Specific Integrated Circuit (ASIC) design, with the electronic analog computing (EAC) unit reconfiguring analog pathways to accelerate specific tasks. The development of ACCEL represents a significant achievement in computing architecture for the AI era, with potential practical applications in various fields. Source : https://www.tomshardware.com/tech-industry/semiconductors/chinas-accel-analog-chip-promises-to-outpace-industry-best-in-ai-acceleration-for-vision-tasks submitted by /u/NuseAI [link] [comments]
    Elon Musk is getting ready to launch his first AI model to premium X users. 'Grok' will be 'based' and 'loves sarcasm,' Musk said.
    submitted by /u/thisisinsider [link] [comments]
    Firms like Meta and A16z admit having to pay billions for training data would ruin their generative-AI plans as they fight new copyright rules
    submitted by /u/geekteam6 [link] [comments]
    What's Under the Hood of Adobe Firefly?
    Hey everyone, I've been exploring the capabilities of Adobe Firefly, and I'm quite intrigued by its functionalities. I understand it's Adobe's AI framework designed for creative tasks, but I'm curious about the specific technologies and models it employs. Does anyone have insights into the type of models or algorithms that power Adobe Firefly? Is it using something similar to GANs, CNNs, or perhaps a different kind of neural network architecture? Also, how does it compare to other image-based AI models like DALL-E or CLIP in terms of its image processing and generation capabilities? Would love to dive deeper into the technical details if anyone's got the scoop! submitted by /u/cheapnessltd [link] [comments]
    Agi will be the end of humans
    Ai is the end for humanity. We’ll probably evolve into some cyborg hybrid integrated with computer chips or our bodies will be preserved like in the matrix while the ai cyborgs harvest our consciousness to exist. Sure right now companies can cut costs but eventually there’s no need for most companies out there if no one works. Money will become worthless too since you don’t need it to survive and will probably hunt or farm for food. The more I think about it, it seems like the human species has some suicidal death wish programmed into our brains or, more likely, we are competing with another intelligence that’s using us as means to an end of its evolution. submitted by /u/YSLFAHLIFE [link] [comments]
    "Understanding the Potential Dangers of AI Humanoids: Insights from Elon Musk"
    submitted by /u/Fit-Code-5141 [link] [comments]
    How To Outsource AI Content Creation 3x Cheaper With Freelancers
    hello readers Not so long ago I finished writing my article about How To Outsource AI Content Creation 3x Cheaper With Freelancers. I was wondering what real fans and admirers of AI topics think about it, I really want you to read my article and give some fair feedback about it. submitted by /u/PerceptionPlayful469 [link] [comments]
  • Open

    Guide for MARLLib
    The documentation for MARLlib is pretty lacklustre, does anyone know of any tutorial on how to make custom environments work with it, and also example code for handling training loops, etc? I'm mostly having issues with understanding what the ''make()'' function does from marllib. submitted by /u/EquivalentCurious745 [link] [comments]
    Custom Boid Flocking environment (Open AI Gym)
    Background: Boids info are given here: https://en.wikipedia.org/wiki/Boids. I was able to successfully implement Reynold's model flocking (Results below). My open ai gym implementation doesn't work though. Objective: Build an RL Custom Open AI Gym Boid flocking environment, trained on Stable Baselines3 PPO algorithm. Error: Error What I have tried: Initializations and NaN value debugging. Honestly, have no idea what to do. I am an amateur with like 2 months of experience in Open AI gym, please be gentle. ​ Results(Reynold's Model): Reynold's model flocking with 20 agents -RL code is named as Env.py and Error as Error.txt. -Flocking using reynold's model is called Agent.py, it works perfectly ​ ​ https://preview.redd.it/wlsasy6r5dyb1.png?width=1569&format=png&auto=webp&s=59661ae34273b459b2cf193775c1861851f6747e ​ Link to files and error: https://drive.google.com/drive/folders/1RhsVen6CQNh0b1PWqT7FbTggYKDKEqsF?usp=sharing submitted by /u/Sadboi1010 [link] [comments]
    A beginner friendly introduction to Deep RL discussing four of the greatest seminal works
    Hey people, sharing a video from my ML YT channel where I discuss what RL is all about and discuss four great papers from 2010s that personally got me into the subject during my grad study days… its kinda beginner friendly in tone, but more appropriately it highlights the strengths and challenges, and the key algorithmic ideas in the field! submitted by /u/AvvYaa [link] [comments]
    Visual observations for centralised critic in MADDPG
    Hello everyone, I'm working currently on a project that involves a multi-agent systems problem that I intend to solve using MARL. One common good practice when doing MARL, is to use a centralized critic for all agents, this critic takes as input the observations of all agents. In the case where the agents rely on RGB images, there are many ways to feed the observations to the centralized critic. My question is what is the best way to do so ( Should i just concatentate the images across the channel, Generate features of each image and concatenate after ...etc ) ?. Thank you in advance. ​ submitted by /u/Many_Reception_4921 [link] [comments]
    RL courses online
    What courses are really useful to understand the theory of RL? Stanford? Berkeley? submitted by /u/MomoSolar [link] [comments]
    Q-LP
    Are there any sources for the Linear Programming solution to the Q-function? submitted by /u/MomoSolar [link] [comments]
    RL books - references
    Are there references that explain algorithms in modern RL like TRPO, PPO, A3C, in a clear way, and possibly implements them? submitted by /u/MomoSolar [link] [comments]
    DDPG tutorial
    Is there a site that explains DDPG theory and has an implementation for it in a very clear way? submitted by /u/MomoSolar [link] [comments]
    Policy Gradient for Continous actions
    What if my action space is continuous, but is within a certain (interval) range? submitted by /u/MomoSolar [link] [comments]
    Q-learning for demand response
    In the paper (in the first comment), optimal electricity prices are determined via Q-learning. The MDP includes energy demand as states and electricity prices as actions. The reward is a weighted sum of service provider profit and customer satisfaction. In this case, state t=2 need not be dependent on state t=1 and action t = 1. I do not understand how Q(st, at) can then represent the discounted sum of expected rewards, especially that st+1 may not follow from an action taken at st. Is the modelling of the MDP valid? https://pdf.sciencedirectassets.com/271429/1-s2.0-S0306261918X00099/1-s2.0-S0306261918304112/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjELr%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJHMEUCIGsBT5yR8u2kFHHVNsJMX4FAkc%2FB%2BuT0Elulb6gCmnntAiEAi2OBcPqvWhhGAQvYRKCCkc6dBRB4…
    AI MarketPlace to buy and sell ML models
    Hi, Im working on creating an AI marketplace where developers can upload models and startups, and enterprises can deploy and run them in the cloud at scale. Any feedback would be greatly appreciated! We are currently onboarding developers and waitlisting buyers. Here is our interest form: https://forms.gle/X4Wy7NyMcWULddEBA submitted by /u/Dismal-Call2668 [link] [comments]
  • Open

    Your Data-to-Value Journey Starts with AI and Data Literacy
    79.8% of organizations cite cultural barriers to data adoption, yet AI and data literacy rank at only 1.6% in the CDO’s list of priorities. I find the research from New Vantage Partners, headed by industry legends Tom Davenport and Randy Bean, incredibly valuable.  Their annual “Data and Analytics Leadership Annual Executive Survey” series delivers invaluable… Read More »Your Data-to-Value Journey Starts with AI and Data Literacy The post Your Data-to-Value Journey Starts with AI and Data Literacy appeared first on Data Science Central.  ( 23 min )
  • Open

    How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. (arXiv:2305.00586v5 [cs.CL] UPDATED)
    Pre-trained language models can be surprisingly adept at tasks they were not explicitly trained on, but how they implement these capabilities is poorly understood. In this paper, we investigate the basic mathematical abilities often acquired by pre-trained language models. Concretely, we use mechanistic interpretability techniques to explain the (limited) mathematical abilities of GPT-2 small. As a case study, we examine its ability to take in sentences such as "The war lasted from the year 1732 to the year 17", and predict valid two-digit end years (years > 32). We first identify a circuit, a small subset of GPT-2 small's computational graph that computes this task's output. Then, we explain the role of each circuit component, showing that GPT-2 small's final multi-layer perceptrons boost the probability of end years greater than the start year. Finally, we find related tasks that activate our circuit. Our results suggest that GPT-2 small computes greater-than using a complex but general mechanism that activates across diverse contexts.
    Generalized Bayesian Inference for Scientific Simulators via Amortized Cost Estimation. (arXiv:2305.15208v2 [stat.ML] UPDATED)
    Simulation-based inference (SBI) enables amortized Bayesian inference for simulators with implicit likelihoods. But when we are primarily interested in the quality of predictive simulations, or when the model cannot exactly reproduce the observed data (i.e., is misspecified), targeting the Bayesian posterior may be overly restrictive. Generalized Bayesian Inference (GBI) aims to robustify inference for (misspecified) simulator models, replacing the likelihood-function with a cost function that evaluates the goodness of parameters relative to data. However, GBI methods generally require running multiple simulations to estimate the cost function at each parameter value during inference, making the approach computationally infeasible for even moderately complex simulators. Here, we propose amortized cost estimation (ACE) for GBI to address this challenge: We train a neural network to approximate the cost function, which we define as the expected distance between simulations produced by a parameter and observed data. The trained network can then be used with MCMC to infer GBI posteriors for any observation without running additional simulations. We show that, on several benchmark tasks, ACE accurately predicts cost and provides predictive simulations that are closer to synthetic observations than other SBI methods, especially for misspecified simulators. Finally, we apply ACE to infer parameters of the Hodgkin-Huxley model given real intracellular recordings from the Allen Cell Types Database. ACE identifies better data-matching parameters while being an order of magnitude more simulation-efficient than a standard SBI method. In summary, ACE combines the strengths of SBI methods and GBI to perform robust and simulation-amortized inference for scientific simulators.
    Bayesian Design Principles for Frequentist Sequential Learning. (arXiv:2310.00806v2 [cs.LG] UPDATED)
    We develop a general theory to optimize the frequentist regret for sequential learning problems, where efficient bandit and reinforcement learning algorithms can be derived from unified Bayesian principles. We propose a novel optimization approach to generate "algorithmic beliefs" at each round, and use Bayesian posteriors to make decisions. The optimization objective to create "algorithmic beliefs," which we term "Algorithmic Information Ratio," represents an intrinsic complexity measure that effectively characterizes the frequentist regret of any algorithm. To the best of our knowledge, this is the first systematical approach to make Bayesian-type algorithms prior-free and applicable to adversarial settings, in a generic and optimal manner. Moreover, the algorithms are simple and often efficient to implement. As a major application, we present a novel algorithm for multi-armed bandits that achieves the "best-of-all-worlds" empirical performance in the stochastic, adversarial, and non-stationary environments. And we illustrate how these principles can be used in linear bandits, bandit convex optimization, and reinforcement learning.
    Exclusive Group Lasso for Structured Variable Selection. (arXiv:2108.10284v2 [cs.LG] UPDATED)
    A structured variable selection problem is considered in which the covariates, divided into predefined groups, activate according to sparse patterns with few nonzero entries per group. Capitalizing on the concept of atomic norm, a composite norm can be properly designed to promote such exclusive group sparsity patterns. The resulting norm lends itself to efficient and flexible regularized optimization algorithms for support recovery, like the proximal algorithm. Moreover, an active set algorithm is proposed that builds the solution by successively including structure atoms into the estimated support. It is also shown that such an algorithm can be tailored to match more rigid structures than plain exclusive group sparsity. Asymptotic consistency analysis (with both the number of parameters as well as the number of groups growing with the observation size) establishes the effectiveness of the proposed solution in terms of signed support recovery under conventional assumptions. Finally, a set of numerical simulations further corroborates the results.
    Deep Neural Networks for Automatic Speaker Recognition Do Not Learn Supra-Segmental Temporal Features. (arXiv:2311.00489v2 [cs.SD] UPDATED)
    While deep neural networks have shown impressive results in automatic speaker recognition and related tasks, it is dissatisfactory how little is understood about what exactly is responsible for these results. Part of the success has been attributed in prior work to their capability to model supra-segmental temporal information (SST), i.e., learn rhythmic-prosodic characteristics of speech in addition to spectral features. In this paper, we (i) present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST; and (ii) present several means to force respective nets to focus more on SST and evaluate their merits. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced. The results provide a highly relevant basis for impactful future research into better exploitation of the full speech signal and give insights into the inner workings of such networks, enhancing explainability of deep learning for speech technologies.
    Extremal Domain Translation with Neural Optimal Transport. (arXiv:2301.12874v3 [cs.LG] UPDATED)
    In many unpaired image domain translation problems, e.g., style transfer or super-resolution, it is important to keep the translated image similar to its respective input image. We propose the extremal transport (ET) which is a mathematical formalization of the theoretically best possible unpaired translation between a pair of domains w.r.t. the given similarity function. Inspired by the recent advances in neural optimal transport (OT), we propose a scalable algorithm to approximate ET maps as a limit of partial OT maps. We test our algorithm on toy examples and on the unpaired image-to-image translation task. The code is publicly available at https://github.com/milenagazdieva/ExtremalNeuralOptimalTransport
    Private Graph Extraction via Feature Explanations. (arXiv:2206.14724v2 [cs.LG] UPDATED)
    Privacy and interpretability are two important ingredients for achieving trustworthy machine learning. We study the interplay of these two aspects in graph machine learning through graph reconstruction attacks. The goal of the adversary here is to reconstruct the graph structure of the training data given access to model explanations. Based on the different kinds of auxiliary information available to the adversary, we propose several graph reconstruction attacks. We show that additional knowledge of post-hoc feature explanations substantially increases the success rate of these attacks. Further, we investigate in detail the differences between attack performance with respect to three different classes of explanation methods for graph neural networks: gradient-based, perturbation-based, and surrogate model-based methods. While gradient-based explanations reveal the most in terms of the graph structure, we find that these explanations do not always score high in utility. For the other two classes of explanations, privacy leakage increases with an increase in explanation utility. Finally, we propose a defense based on a randomized response mechanism for releasing the explanations, which substantially reduces the attack success rate. Our code is available at https://github.com/iyempissy/graph-stealing-attacks-with-explanation
    Fedstellar: A Platform for Decentralized Federated Learning. (arXiv:2306.09750v2 [cs.LG] UPDATED)
    In 2016, Google proposed Federated Learning (FL) as a novel paradigm to train Machine Learning (ML) models across the participants of a federation while preserving data privacy. Since its birth, Centralized FL (CFL) has been the most used approach, where a central entity aggregates participants' models to create a global one. However, CFL presents limitations such as communication bottlenecks, single point of failure, and reliance on a central server. Decentralized Federated Learning (DFL) addresses these issues by enabling decentralized model aggregation and minimizing dependency on a central entity. Despite these advances, current platforms training DFL models struggle with key issues such as managing heterogeneous federation network topologies. To overcome these challenges, this paper presents Fedstellar, a novel platform designed to train FL models in a decentralized, semi-decentralized, and centralized fashion across diverse federations of physical or virtualized devices. The Fedstellar implementation encompasses a web application with an interactive graphical interface, a controller for deploying federations of nodes using physical or virtual devices, and a core deployed on each device which provides the logic needed to train, aggregate, and communicate in the network. The effectiveness of the platform has been demonstrated in two scenarios: a physical deployment involving single-board devices such as Raspberry Pis for detecting cyberattacks, and a virtualized deployment comparing various FL approaches in a controlled environment using MNIST and CIFAR-10 datasets. In both scenarios, Fedstellar demonstrated consistent performance and adaptability, achieving F1 scores of 91%, 98%, and 91.2% using DFL for detecting cyberattacks and classifying MNIST and CIFAR-10, respectively, reducing training time by 32% compared to centralized approaches.
    Castor: Causal Temporal Regime Structure Learning. (arXiv:2311.01412v1 [cs.LG])
    The task of uncovering causal relationships among multivariate time series data stands as an essential and challenging objective that cuts across a broad array of disciplines ranging from climate science to healthcare. Such data entails linear or non-linear relationships, and usually follow multiple a priori unknown regimes. Existing causal discovery methods can infer summary causal graphs from heterogeneous data with known regimes, but they fall short in comprehensively learning both regimes and the corresponding causal graph. In this paper, we introduce CASTOR, a novel framework designed to learn causal relationships in heterogeneous time series data composed of various regimes, each governed by a distinct causal graph. Through the maximization of a score function via the EM algorithm, CASTOR infers the number of regimes and learns linear or non-linear causal relationships in each regime. We demonstrate the robust convergence properties of CASTOR, specifically highlighting its proficiency in accurately identifying unique regimes. Empirical evidence, garnered from exhaustive synthetic experiments and two real-world benchmarks, confirm CASTOR's superior performance in causal discovery compared to baseline methods. By learning a full temporal causal graph for each regime, CASTOR establishes itself as a distinctly interpretable method for causal discovery in heterogeneous time series.
    Lossy Image Compression with Conditional Diffusion Models. (arXiv:2209.06950v6 [eess.IV] UPDATED)
    This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional "content" latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining "texture" variables characterizing the diffusion process are synthesized at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving multiple datasets and image quality assessment metrics show that our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics. Furthermore, training the diffusion with X-parameterization enables high-quality reconstructions in only a handful of decoding steps, greatly affecting the model's practicality.
    SRN-SZ: Deep Leaning-Based Scientific Error-bounded Lossy Compression with Super-resolution Neural Networks. (arXiv:2309.04037v2 [cs.LG] UPDATED)
    The fast growth of computational power and scales of modern super-computing systems have raised great challenges for the management of exascale scientific data. To maintain the usability of scientific data, error-bound lossy compression is proposed and developed as an essential technique for the size reduction of scientific data with constrained data distortion. Among the diverse datasets generated by various scientific simulations, certain datasets cannot be effectively compressed by existing error-bounded lossy compressors with traditional techniques. The recent success of Artificial Intelligence has inspired several researchers to integrate neural networks into error-bounded lossy compressors. However, those works still suffer from limited compression ratios and/or extremely low efficiencies. To address those issues and improve the compression on the hard-to-compress datasets, in this paper, we propose SRN-SZ, which is a deep learning-based scientific error-bounded lossy compressor leveraging the hierarchical data grid expansion paradigm implemented by super-resolution neural networks. SRN-SZ applies the most advanced super-resolution network HAT for its compression, which is free of time-costing per-data training. In experiments compared with various state-of-the-art compressors, SRN-SZ achieves up to 75% compression ratio improvements under the same error bound and up to 80% compression ratio improvements under the same PSNR than the second-best compressor.
    TRIALSCOPE A Unifying Causal Framework for Scaling Real-World Evidence Generation with Biomedical Language Models. (arXiv:2311.01301v1 [cs.LG])
    The rapid digitization of real-world data offers an unprecedented opportunity for optimizing healthcare delivery and accelerating biomedical discovery. In practice, however, such data is most abundantly available in unstructured forms, such as clinical notes in electronic medical records (EMRs), and it is generally plagued by confounders. In this paper, we present TRIALSCOPE, a unifying framework for distilling real-world evidence from population-level observational data. TRIALSCOPE leverages biomedical language models to structure clinical text at scale, employs advanced probabilistic modeling for denoising and imputation, and incorporates state-of-the-art causal inference techniques to combat common confounders. Using clinical trial specification as generic representation, TRIALSCOPE provides a turn-key solution to generate and reason with clinical hypotheses using observational data. In extensive experiments and analyses on a large-scale real-world dataset with over one million cancer patients from a large US healthcare network, we show that TRIALSCOPE can produce high-quality structuring of real-world data and generates comparable results to marquee cancer trials. In addition to facilitating in-silicon clinical trial design and optimization, TRIALSCOPE may be used to empower synthetic controls, pragmatic trials, post-market surveillance, as well as support fine-grained patient-like-me reasoning in precision diagnosis and treatment.
    The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing. (arXiv:2311.01410v1 [cs.CV])
    We present a unified probabilistic formulation for diffusion-based image editing, where a latent variable is edited in a task-specific manner and generally deviates from the corresponding marginal distribution induced by the original stochastic or ordinary differential equation (SDE or ODE). Instead, it defines a corresponding SDE or ODE for editing. In the formulation, we prove that the Kullback-Leibler divergence between the marginal distributions of the two SDEs gradually decreases while that for the ODEs remains as the time approaches zero, which shows the promise of SDE in image editing. Inspired by it, we provide the SDE counterparts for widely used ODE baselines in various tasks including inpainting and image-to-image translation, where SDE shows a consistent and substantial improvement. Moreover, we propose SDE-Drag -- a simple yet effective method built upon the SDE formulation for point-based content dragging. We build a challenging benchmark (termed DragBench) with open-set natural, art, and AI-generated images for evaluation. A user study on DragBench indicates that SDE-Drag significantly outperforms our ODE baseline, existing diffusion-based methods, and the renowned DragGAN. Our results demonstrate the superiority and versatility of SDE in image editing and push the boundary of diffusion-based editing methods.
    QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models. (arXiv:2310.09259v2 [cs.LG] UPDATED)
    Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantization, which can reduce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address them in compute-bound scenarios, such as batched inference or prompt processing. In this paper, we address the general quantization problem, where both weights and activations should be quantized. We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit, while keeping some outlier weights and activations in higher-precision. The key feature of our scheme is that it is designed with computational efficiency in mind: we provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x relative to FP16 execution. We provide detailed studies for models from the OPT, LLaMA-2 and Falcon families, as well as a first instance of accurate inference using quantization plus 2:4 sparsity. Code is available at: https://github.com/IST-DASLab/QUIK.
    Description and Discussion on DCASE 2023 Challenge Task 2: First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring. (arXiv:2305.07828v2 [cs.SD] UPDATED)
    We present the task description of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge Task 2: ``First-shot unsupervised anomalous sound detection (ASD) for machine condition monitoring''. The main goal is to enable rapid deployment of ASD systems for new kinds of machines without the need for hyperparameter tuning. In the past ASD tasks, developed methods tuned hyperparameters for each machine type, as the development and evaluation datasets had the same machine types. However, collecting normal and anomalous data as the development dataset can be infeasible in practice. In 2023 Task 2, we focus on solving the first-shot problem, which is the challenge of training a model on a completely novel machine type. Specifically, (i) each machine type has only one section (a subset of machine type) and (ii) machine types in the development and evaluation datasets are completely different. Analysis of 86 submissions from 23 teams revealed that the keys to outperform baselines were: 1) sampling techniques for dealing with class imbalances across different domains and attributes, 2) generation of synthetic samples for robust detection, and 3) use of multiple large pre-trained models to extract meaningful embeddings for the anomaly detector.
    MiliPoint: A Point Cloud Dataset for mmWave Radar. (arXiv:2309.13425v2 [cs.LG] UPDATED)
    Millimetre-wave (mmWave) radar has emerged as an attractive and cost-effective alternative for human activity sensing compared to traditional camera-based systems. mmWave radars are also non-intrusive, providing better protection for user privacy. However, as a Radio Frequency (RF) based technology, mmWave radars rely on capturing reflected signals from objects, making them more prone to noise compared to cameras. This raises an intriguing question for the deep learning community: Can we develop more effective point set-based deep learning methods for such attractive sensors? To answer this question, our work, termed MiliPoint, delves into this idea by providing a large-scale, open dataset for the community to explore how mmWave radars can be utilised for human activity recognition. Moreover, MiliPoint stands out as it is larger in size than existing datasets, has more diverse human actions represented, and encompasses all three key tasks in human activity recognition. We have also established a range of point-based deep neural networks such as DGCNN, PointNet++ and PointTransformer, on MiliPoint, which can serve to set the ground baseline for further development.
    Bridging Machine Learning and Sciences: Opportunities and Challenges. (arXiv:2210.13441v2 [stat.ML] UPDATED)
    The application of machine learning in sciences has seen exciting advances in recent years. As a widely applicable technique, anomaly detection has been long studied in the machine learning community. Especially, deep neural nets-based out-of-distribution detection has made great progress for high-dimensional data. Recently, these techniques have been showing their potential in scientific disciplines. We take a critical look at their applicative prospects including data universality, experimental protocols, model robustness, etc. We discuss examples that display transferable practices and domain-specific challenges simultaneously, providing a starting point for establishing a novel interdisciplinary research paradigm in the near future.
    Anonymous Learning via Look-Alike Clustering: A Precise Analysis of Model Generalization. (arXiv:2310.04015v3 [cs.LG] UPDATED)
    While personalized recommendations systems have become increasingly popular, ensuring user data protection remains a top concern in the development of these learning systems. A common approach to enhancing privacy involves training models using anonymous data rather than individual data. In this paper, we explore a natural technique called \emph{look-alike clustering}, which involves replacing sensitive features of individuals with the cluster's average values. We provide a precise analysis of how training models using anonymous cluster centers affects their generalization capabilities. We focus on an asymptotic regime where the size of the training set grows in proportion to the features dimension. Our analysis is based on the Convex Gaussian Minimax Theorem (CGMT) and allows us to theoretically understand the role of different model components on the generalization error. In addition, we demonstrate that in certain high-dimensional regimes, training over anonymous cluster centers acts as a regularization and improves generalization error of the trained models. Finally, we corroborate our asymptotic theory with finite-sample numerical experiments where we observe a perfect match when the sample size is only of order of a few hundreds.
    Ranking with Popularity Bias: User Welfare under Self-Amplification Dynamics. (arXiv:2305.18333v2 [cs.IR] UPDATED)
    While popularity bias is recognized to play a crucial role in recommmender (and other ranking-based) systems, detailed analysis of its impact on collective user welfare has largely been lacking. We propose and theoretically analyze a general mechanism, rooted in many of the models proposed in the literature, by which item popularity, item quality, and position bias jointly impact user choice. We focus on a standard setting in which user utility is largely driven by item quality, and a recommender attempts to estimate it given user behavior. Formulating the problem as a non-stationary contextual bandit, we study the ability of a recommender policy to maximize user welfare under this model. We highlight the importance of exploration, not to eliminate popularity bias, but to mitigate its negative impact on welfare. We first show that naive popularity-biased recommenders induce linear regret by conflating item quality and popularity. More generally, we show that, even in linear settings, identifiability of item quality may not be possible due to the confounding effects of popularity bias. However, under sufficient variability assumptions, we develop an efficient optimistic algorithm and prove efficient regret guarantees w.r.t. user welfare. We complement our analysis with several simulation studies, which demonstrate the negative impact of popularity bias on the performance of several natural recommender policies.
    Combining Optimal Path Search With Task-Dependent Learning in a Neural Network. (arXiv:2201.11104v6 [cs.LG] UPDATED)
    Finding optimal paths in connected graphs requires determining the smallest total cost for traveling along the graph's edges. This problem can be solved by several classical algorithms where, usually, costs are predefined for all edges. Conventional planning methods can, thus, normally not be used when wanting to change costs in an adaptive way following the requirements of some task. Here we show that one can define a neural network representation of path finding problems by transforming cost values into synaptic weights, which allows for online weight adaptation using network learning mechanisms. When starting with an initial activity value of one, activity propagation in this network will lead to solutions, which are identical to those found by the Bellman-Ford algorithm. The neural network has the same algorithmic complexity as Bellman-Ford and, in addition, we can show that network learning mechanisms (such as Hebbian learning) can adapt the weights in the network augmenting the resulting paths according to some task at hand. We demonstrate this by learning to navigate in an environment with obstacles as well as by learning to follow certain sequences of path nodes. Hence, the here-presented novel algorithm may open up a different regime of applications where path-augmentation (by learning) is directly coupled with path finding in a natural way.
    Fine-grained Expressivity of Graph Neural Networks. (arXiv:2306.03698v2 [cs.LG] UPDATED)
    Numerous recent works have analyzed the expressive power of message-passing graph neural networks (MPNNs), primarily utilizing combinatorial techniques such as the $1$-dimensional Weisfeiler-Leman test ($1$-WL) for the graph isomorphism problem. However, the graph isomorphism objective is inherently binary, not giving insights into the degree of similarity between two given graphs. This work resolves this issue by considering continuous extensions of both $1$-WL and MPNNs to graphons. Concretely, we show that the continuous variant of $1$-WL delivers an accurate topological characterization of the expressive power of MPNNs on graphons, revealing which graphs these networks can distinguish and the level of difficulty in separating them. We identify the finest topology where MPNNs separate points and prove a universal approximation theorem. Consequently, we provide a theoretical framework for graph and graphon similarity combining various topological variants of classical characterizations of the $1$-WL. In particular, we characterize the expressive power of MPNNs in terms of the tree distance, which is a graph distance based on the concept of fractional isomorphisms, and substructure counts via tree homomorphisms, showing that these concepts have the same expressive power as the $1$-WL and MPNNs on graphons. Empirically, we validate our theoretical findings by showing that randomly initialized MPNNs, without training, exhibit competitive performance compared to their trained counterparts. Moreover, we evaluate different MPNN architectures based on their ability to preserve graph distances, highlighting the significance of our continuous $1$-WL test in understanding MPNNs' expressivity.
    A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories. (arXiv:2311.01329v1 [cs.LG])
    Offline imitation from observations aims to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available. Offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable. The state-of-the-art "DIstribution Correction Estimation" (DICE) methods minimize divergence of state occupancy between expert and learner policies and retrieve a policy with weighted behavior cloning; however, their results are unstable when learning from incomplete trajectories, due to a non-robust optimization in the dual domain. To address the issue, in this paper, we propose Trajectory-Aware Imitation Learning from Observations (TAILO). TAILO uses a discounted sum along the future trajectory as the weight for weighted behavior cloning. The terms for the sum are scaled by the output of a discriminator, which aims to identify expert states. Despite simplicity, TAILO works well if there exist trajectories or segments of expert behavior in the task-agnostic data, a common assumption in prior work. In experiments across multiple testbeds, we find TAILO to be more robust and effective, particularly with incomplete trajectories.
    Model-free Policy Learning with Reward Gradients. (arXiv:2103.05147v4 [cs.LG] UPDATED)
    Despite the increasing popularity of policy gradient methods, they are yet to be widely utilized in sample-scarce applications, such as robotics. The sample efficiency could be improved by making best usage of available information. As a key component in reinforcement learning, the reward function is usually devised carefully to guide the agent. Hence, the reward function is usually known, allowing access to not only scalar reward signals but also reward gradients. To benefit from reward gradients, previous works require the knowledge of environment dynamics, which are hard to obtain. In this work, we develop the \textit{Reward Policy Gradient} estimator, a novel approach that integrates reward gradients without learning a model. Bypassing the model dynamics allows our estimator to achieve a better bias-variance trade-off, which results in a higher sample efficiency, as shown in the empirical analysis. Our method also boosts the performance of Proximal Policy Optimization on different MuJoCo control tasks.
    On Learning Gaussian Multi-index Models with Gradient Flow. (arXiv:2310.19793v2 [stat.ML] UPDATED)
    We study gradient flow on the multi-index regression problem for high-dimensional Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear projection and an arbitrary unknown, low-dimensional link function. As such, they constitute a natural template for feature learning in neural networks. We consider a two-timescale algorithm, whereby the low-dimensional link function is learnt with a non-parametric model infinitely faster than the subspace parametrizing the low-rank projection. By appropriately exploiting the matrix semigroup structure arising over the subspace correlation matrices, we establish global convergence of the resulting Grassmannian population gradient flow dynamics, and provide a quantitative description of its associated `saddle-to-saddle' dynamics. Notably, the timescales associated with each saddle can be explicitly characterized in terms of an appropriate Hermite decomposition of the target link function. In contrast with these positive results, we also show that the related \emph{planted} problem, where the link function is known and fixed, in fact has a rough optimization landscape, in which gradient flow dynamics might get trapped with high probability.
    Adv3D: Generating Safety-Critical 3D Objects through Closed-Loop Simulation. (arXiv:2311.01446v1 [cs.RO])
    Self-driving vehicles (SDVs) must be rigorously tested on a wide range of scenarios to ensure safe deployment. The industry typically relies on closed-loop simulation to evaluate how the SDV interacts on a corpus of synthetic and real scenarios and verify it performs properly. However, they primarily only test the system's motion planning module, and only consider behavior variations. It is key to evaluate the full autonomy system in closed-loop, and to understand how variations in sensor data based on scene appearance, such as the shape of actors, affect system performance. In this paper, we propose a framework, Adv3D, that takes real world scenarios and performs closed-loop sensor simulation to evaluate autonomy performance, and finds vehicle shapes that make the scenario more challenging, resulting in autonomy failures and uncomfortable SDV maneuvers. Unlike prior works that add contrived adversarial shapes to vehicle roof-tops or roadside to harm perception only, we optimize a low-dimensional shape representation to modify the vehicle shape itself in a realistic manner to degrade autonomy performance (e.g., perception, prediction, and motion planning). Moreover, we find that the shape variations found with Adv3D optimized in closed-loop are much more effective than those in open-loop, demonstrating the importance of finding scene appearance variations that affect autonomy in the interactive setting.
    Computable Phenotypes of Patient Acuity in the Intensive Care Unit. (arXiv:2005.05163v2 [q-bio.QM] UPDATED)
    Continuous monitoring and patient acuity assessments are key aspects of Intensive Care Unit (ICU) practice, but both are limited by time constraints imposed on healthcare providers. Moreover, anticipating clinical trajectories remains imprecise. The objectives of this study are to (1) develop an electronic phenotype of acuity using automated variable retrieval within the electronic health records and (2) describe transitions between acuity states that illustrate the clinical trajectories of ICU patients. We gathered two single-center, longitudinal electronic health record datasets for 51,372 adult ICU patients admitted to the University of Florida Health (UFH) Gainesville (GNV) and Jacksonville (JAX). We developed algorithms to quantify acuity status at four-hour intervals for each ICU admission and identify acuity phenotypes using continuous acuity status and k-means clustering approach. 51,073 admissions for 38,749 patients in the UFH GNV dataset and 22,219 admissions for 12,623 patients in the UFH JAX dataset had at least one ICU stay lasting more than four hours. There were three phenotypes: persistently stable, persistently unstable, and transitioning from unstable to stable. For stable patients, approximately 0.7%-1.7% would transition to unstable, 0.02%-0.1% would expire, 1.2%-3.4% would be discharged, and the remaining 96%-97% would remain stable in the ICU every four hours. For unstable patients, approximately 6%-10% would transition to stable, 0.4%-0.5% would expire, and the remaining 89%-93% would remain unstable in the ICU in the next four hours. We developed phenotyping algorithms for patient acuity status every four hours while admitted to the ICU. This approach may be useful in developing prognostic and clinical decision-support tools to aid patients, caregivers, and providers in shared decision-making processes regarding escalation of care and patient values.
    EVBattery: A Large-Scale Electric Vehicle Dataset for Battery Health and Capacity Estimation. (arXiv:2201.12358v3 [cs.LG] UPDATED)
    Electric vehicles (EVs) play an important role in reducing carbon emissions. As EV adoption accelerates, safety issues caused by EV batteries have become an important research topic. In order to benchmark and develop data-driven methods for this task, we introduce a large and comprehensive dataset of EV batteries. Our dataset includes charging records collected from hundreds of EVs from three manufacturers over several years. Our dataset is the first large-scale public dataset on real-world battery data, as existing data either include only several vehicles or is collected in the lab environment. Meanwhile, our dataset features two types of labels, corresponding to two key tasks - battery health estimation and battery capacity estimation. In addition to demonstrating how existing deep learning algorithms can be applied to this task, we further develop an algorithm that exploits the data structure of battery systems. Our algorithm achieves better results and shows that a customized method can improve model performances. We hope that this public dataset provides valuable resources for researchers, policymakers, and industry professionals to better understand the dynamics of EV battery aging and support the transition toward a sustainable transportation system.
    High-dimensional Linear Bandits with Knapsacks. (arXiv:2311.01327v1 [cs.LG])
    We study the contextual bandits with knapsack (CBwK) problem under the high-dimensional setting where the dimension of the feature is large. The reward of pulling each arm equals the multiplication of a sparse high-dimensional weight vector and the feature of the current arrival, with additional random noise. In this paper, we investigate how to exploit this sparsity structure to achieve improved regret for the CBwK problem. To this end, we first develop an online variant of the hard thresholding algorithm that performs the sparse estimation in an online manner. We further combine our online estimator with a primal-dual framework, where we assign a dual variable to each knapsack constraint and utilize an online learning algorithm to update the dual variable, thereby controlling the consumption of the knapsack capacity. We show that this integrated approach allows us to achieve a sublinear regret that depends logarithmically on the feature dimension, thus improving the polynomial dependency established in the previous literature. We also apply our framework to the high-dimension contextual bandit problem without the knapsack constraint and achieve optimal regret in both the data-poor regime and the data-rich regime. We finally conduct numerical experiments to show the efficient empirical performance of our algorithms under the high dimensional setting.
    Collaborative Learning via Prediction Consensus. (arXiv:2305.18497v2 [cs.LG] UPDATED)
    We consider a collaborative learning setting where the goal of each agent is to improve their own model by leveraging the expertise of collaborators, in addition to their own training data. To facilitate the exchange of expertise among agents, we propose a distillation-based method leveraging shared unlabeled auxiliary data, which is pseudo-labeled by the collective. Central to our method is a trust weighting scheme that serves to adaptively weigh the influence of each collaborator on the pseudo-labels until a consensus on how to label the auxiliary data is reached. We demonstrate empirically that our collaboration scheme is able to significantly boost individual models' performance in the target domain from which the auxiliary data is sampled. At the same time, it can provably mitigate the negative impact of bad models on the collective. By design, our method adeptly accommodates heterogeneity in model architectures and substantially reduces communication overhead compared to typical collaborative learning methods.
    Causal Falsification of Digital Twins. (arXiv:2301.07210v4 [stat.ME] UPDATED)
    Digital twins are virtual systems designed to predict how a real-world process will evolve in response to interventions. This modelling paradigm holds substantial promise in many applications, but rigorous procedures for assessing their accuracy are essential for safety-critical settings. We consider how to assess the accuracy of a digital twin using real-world data. We formulate this as causal inference problem, which leads to a precise definition of what it means for a twin to be "correct" appropriate for many applications. Unfortunately, fundamental results from causal inference mean observational data cannot be used to certify that a twin is correct in this sense unless potentially tenuous assumptions are made, such as that the data are unconfounded. To avoid these assumptions, we propose instead to find situations in which the twin is not correct, and present a general-purpose statistical procedure for doing so. Our approach yields reliable and actionable information about the twin under only the assumption of an i.i.d. dataset of observational trajectories, and remains sound even if the data are confounded. We apply our methodology to a large-scale, real-world case study involving sepsis modelling within the Pulse Physiology Engine, which we assess using the MIMIC-III dataset of ICU patients.
    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models. (arXiv:2307.05973v2 [cs.RO] UPDATED)
    Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Videos and code at https://voxposer.github.io
    Targeted Separation and Convergence with Kernel Discrepancies. (arXiv:2209.12835v2 [stat.ML] UPDATED)
    Maximum mean discrepancies (MMDs) like the kernel Stein discrepancy (KSD) have grown central to a wide range of applications, including hypothesis testing, sampler selection, distribution approximation, and variational inference. In each setting, these kernel-based discrepancy measures are required to (i) separate a target P from other probability measures or even (ii) control weak convergence to P. In this article we derive new sufficient and necessary conditions to ensure (i) and (ii). For MMDs on separable metric spaces, we characterize those kernels that separate Bochner embeddable measures and introduce simple conditions for separating all measures with unbounded kernels and for controlling convergence with bounded kernels. We use these results on $\mathbb{R}^d$ to substantially broaden the known conditions for KSD separation and convergence control and to develop the first KSDs known to exactly metrize weak convergence to P. Along the way, we highlight the implications of our results for hypothesis testing, measuring and improving sample quality, and sampling with Stein variational gradient descent.
    (Amplified) Banded Matrix Factorization: A unified approach to private training. (arXiv:2306.08153v2 [cs.LG] UPDATED)
    Matrix factorization (MF) mechanisms for differential privacy (DP) have substantially improved the state-of-the-art in privacy-utility-computation tradeoffs for ML applications in a variety of scenarios, but in both the centralized and federated settings there remain instances where either MF cannot be easily applied, or other algorithms provide better tradeoffs (typically, as $\epsilon$ becomes small). In this work, we show how MF can subsume prior state-of-the-art algorithms in both federated and centralized training settings, across all privacy budgets. The key technique throughout is the construction of MF mechanisms with banded matrices (lower-triangular matrices with at most $\hat{b}$ nonzero bands including the main diagonal). For cross-device federated learning (FL), this enables multiple-participations with a relaxed device participation schema compatible with practical FL infrastructure (as demonstrated by a production deployment). In the centralized setting, we prove that banded matrices enjoy the same privacy amplification results as the ubiquitous DP-SGD algorithm, but can provide strictly better performance in most scenarios -- this lets us always at least match DP-SGD, and often outperform it.
    1D-CapsNet-LSTM: A Deep Learning-Based Model for Multi-Step Stock Index Forecasting. (arXiv:2310.02090v2 [cs.LG] UPDATED)
    Multi-step stock index forecasting is vital in finance for informed decision-making. Current forecasting methods on this task frequently produce unsatisfactory results due to the inherent data randomness and instability, thereby underscoring the demand for advanced forecasting models. Given the superiority of capsule network (CapsNet) over CNN in various forecasting and classification tasks, this study investigates the potential of integrating a 1D CapsNet with an LSTM network for multi-step stock index forecasting. To this end, a hybrid 1D-CapsNet-LSTM model is introduced, which utilizes a 1D CapsNet to generate high-level capsules from sequential data and a LSTM network to capture temporal dependencies. To maintain stochastic dependencies over different forecasting horizons, a multi-input multi-output (MIMO) strategy is employed. The model's performance is evaluated on real-world stock market indices, including S&P 500, DJIA, IXIC, and NYSE, and compared to baseline models, including LSTM, RNN, and CNN-LSTM, using metrics such as RMSE, MAE, MAPE, and TIC. The proposed 1D-CapsNet-LSTM model consistently outperforms baseline models in two key aspects. It exhibits significant reductions in forecasting errors compared to baseline models. Furthermore, it displays a slower rate of error increase with lengthening forecast horizons, indicating increased robustness for multi-step forecasting tasks.
    CADSim: Robust and Scalable in-the-wild 3D Reconstruction for Controllable Sensor Simulation. (arXiv:2311.01447v1 [cs.CV])
    Realistic simulation is key to enabling safe and scalable development of % self-driving vehicles. A core component is simulating the sensors so that the entire autonomy system can be tested in simulation. Sensor simulation involves modeling traffic participants, such as vehicles, with high quality appearance and articulated geometry, and rendering them in real time. The self-driving industry has typically employed artists to build these assets. However, this is expensive, slow, and may not reflect reality. Instead, reconstructing assets automatically from sensor data collected in the wild would provide a better path to generating a diverse and large set with good real-world coverage. Nevertheless, current reconstruction approaches struggle on in-the-wild sensor data, due to its sparsity and noise. To tackle these issues, we present CADSim, which combines part-aware object-class priors via a small set of CAD models with differentiable rendering to automatically reconstruct vehicle geometry, including articulated wheels, with high-quality appearance. Our experiments show our method recovers more accurate shapes from sparse data compared to existing approaches. Importantly, it also trains and renders efficiently. We demonstrate our reconstructed vehicles in several applications, including accurate testing of autonomy perception systems.
    Data Summarization beyond Monotonicity: Non-monotone Two-Stage Submodular Maximization. (arXiv:2309.05183v2 [cs.DS] UPDATED)
    The objective of a two-stage submodular maximization problem is to reduce the ground set using provided training functions that are submodular, with the aim of ensuring that optimizing new objective functions over the reduced ground set yields results comparable to those obtained over the original ground set. This problem has applications in various domains including data summarization. Existing studies often assume the monotonicity of the objective function, whereas our work pioneers the extension of this research to accommodate non-monotone submodular functions. We have introduced the first constant-factor approximation algorithms for this more general case.
    Long Story Short: Omitted Variable Bias in Causal Machine Learning. (arXiv:2112.13398v4 [econ.EM] UPDATED)
    We derive general, yet simple, sharp bounds on the size of the omitted variable bias for a broad class of causal parameters that can be identified as linear functionals of the conditional expectation function of the outcome. Such functionals encompass many of the traditional targets of investigation in causal inference studies, such as, for example, (weighted) average of potential outcomes, average treatment effects (including subgroup effects, such as the effect on the treated), (weighted) average derivatives, and policy effects from shifts in covariate distribution -- all for general, nonparametric causal models. Our construction relies on the Riesz-Frechet representation of the target functional. Specifically, we show how the bound on the bias depends only on the additional variation that the latent variables create both in the outcome and in the Riesz representer for the parameter of interest. Moreover, in many important cases (e.g, average treatment effects and avearage derivatives) the bound is shown to depend on easily interpretable quantities that measure the explanatory power of the omitted variables. Therefore, simple plausibility judgments on the maximum explanatory power of omitted variables (in explaining treatment and outcome variation) are sufficient to place overall bounds on the size of the bias. Furthermore, we use debiased machine learning to provide flexible and efficient statistical inference on learnable components of the bounds. Finally, empirical examples demonstrate the usefulness of the approach.
    ILCAS: Imitation Learning-Based Configuration-Adaptive Streaming for Live Video Analytics with Cross-Camera Collaboration. (arXiv:2308.10068v2 [cs.NI] UPDATED)
    The high-accuracy and resource-intensive deep neural networks (DNNs) have been widely adopted by live video analytics (VA), where camera videos are streamed over the network to resource-rich edge/cloud servers for DNN inference. Common video encoding configurations (e.g., resolution and frame rate) have been identified with significant impacts on striking the balance between bandwidth consumption and inference accuracy and therefore their adaption scheme has been a focus of optimization. However, previous profiling-based solutions suffer from high profiling cost, while existing deep reinforcement learning (DRL) based solutions may achieve poor performance due to the usage of fixed reward function for training the agent, which fails to craft the application goals in various scenarios. In this paper, we propose ILCAS, the first imitation learning (IL) based configuration-adaptive VA streaming system. Unlike DRL-based solutions, ILCAS trains the agent with demonstrations collected from the expert which is designed as an offline optimal policy that solves the configuration adaption problem through dynamic programming. To tackle the challenge of video content dynamics, ILCAS derives motion feature maps based on motion vectors which allow ILCAS to visually ``perceive'' video content changes. Moreover, ILCAS incorporates a cross-camera collaboration scheme to exploit the spatio-temporal correlations of cameras for more proper configuration selection. Extensive experiments confirm the superiority of ILCAS compared with state-of-the-art solutions, with 2-20.9% improvement of mean accuracy and 19.9-85.3% reduction of chunk upload lag.
    Like an Open Book? Read Neural Network Architecture with Simple Power Analysis on 32-bit Microcontrollers. (arXiv:2311.01344v1 [cs.CR])
    Model extraction is a growing concern for the security of AI systems. For deep neural network models, the architecture is the most important information an adversary aims to recover. Being a sequence of repeated computation blocks, neural network models deployed on edge-devices will generate distinctive side-channel leakages. The latter can be exploited to extract critical information when targeted platforms are physically accessible. By combining theoretical knowledge about deep learning practices and analysis of a widespread implementation library (ARM CMSIS-NN), our purpose is to answer this critical question: how far can we extract architecture information by simply examining an EM side-channel trace? For the first time, we propose an extraction methodology for traditional MLP and CNN models running on a high-end 32-bit microcontroller (Cortex-M7) that relies only on simple pattern recognition analysis. Despite few challenging cases, we claim that, contrary to parameters extraction, the complexity of the attack is relatively low and we highlight the urgent need for practicable protections that could fit the strong memory and latency requirements of such platforms.
    Add and Thin: Diffusion for Temporal Point Processes. (arXiv:2311.01139v1 [cs.LG])
    Autoregressive neural networks within the temporal point process (TPP) framework have become the standard for modeling continuous-time event data. Even though these models can expressively capture event sequences in a one-step-ahead fashion, they are inherently limited for long-term forecasting applications due to the accumulation of errors caused by their sequential nature. To overcome these limitations, we derive ADD-THIN, a principled probabilistic denoising diffusion model for TPPs that operates on entire event sequences. Unlike existing diffusion approaches, ADD-THIN naturally handles data with discrete and continuous components. In experiments on synthetic and real-world datasets, our model matches the state-of-the-art TPP models in density estimation and strongly outperforms them in forecasting.
    Online Continual Learning Without the Storage Constraint. (arXiv:2305.09253v2 [cs.CV] UPDATED)
    Traditional online continual learning (OCL) research has primarily focused on mitigating catastrophic forgetting with fixed and limited storage allocation throughout an agent's lifetime. However, a broad range of real-world applications are primarily constrained by computational costs rather than storage limitations. In this paper, we target such applications, investigating the online continual learning problem under relaxed storage constraints and limited computational budgets. We contribute a simple algorithm, which updates a kNN classifier continually along with a fixed, pretrained feature extractor. We selected this algorithm due to its exceptional suitability for online continual learning. It can adapt to rapidly changing streams, has zero stability gap, operates within tiny computational budgets, has low storage requirements by only storing features, and has a consistency property: It never forgets previously seen data. These attributes yield significant improvements, allowing our proposed algorithm to outperform existing methods by over 20% in accuracy on two large-scale OCL datasets: Continual LOCalization (CLOC) with 39M images and 712 classes and Continual Google Landmarks V2 (CGLM) with 580K images and 10,788 classes, even when existing methods retain all previously seen images. Furthermore, we achieve this superior performance with considerably reduced computational and storage expenses. We provide code to reproduce our results at github.com/drimpossible/ACM.
    Combating Bilateral Edge Noise for Robust Link Prediction. (arXiv:2311.01196v1 [cs.LG])
    Although link prediction on graphs has achieved great success with the development of graph neural networks (GNNs), the potential robustness under the edge noise is still less investigated. To close this gap, we first conduct an empirical study to disclose that the edge noise bilaterally perturbs both input topology and target label, yielding severe performance degradation and representation collapse. To address this dilemma, we propose an information-theory-guided principle, Robust Graph Information Bottleneck (RGIB), to extract reliable supervision signals and avoid representation collapse. Different from the basic information bottleneck, RGIB further decouples and balances the mutual dependence among graph topology, target labels, and representation, building new learning objectives for robust representation against the bilateral noise. Two instantiations, RGIB-SSL and RGIB-REP, are explored to leverage the merits of different methodologies, i.e., self-supervised learning and data reparameterization, for implicit and explicit data denoising, respectively. Extensive experiments on six datasets and three GNNs with diverse noisy scenarios verify the effectiveness of our RGIB instantiations. The code is publicly available at: https://github.com/tmlr-group/RGIB.
    The Behavior and Convergence of Local Bayesian Optimization. (arXiv:2305.15572v2 [cs.LG] UPDATED)
    A recent development in Bayesian optimization is the use of local optimization strategies, which can deliver strong empirical performance on high-dimensional problems compared to traditional global strategies. The "folk wisdom" in the literature is that the focus on local optimization sidesteps the curse of dimensionality; however, little is known concretely about the expected behavior or convergence of Bayesian local optimization routines. We first study the behavior of the local approach, and find that the statistics of individual local solutions of Gaussian process sample paths are surprisingly good compared to what we would expect to recover from global methods. We then present the first rigorous analysis of such a Bayesian local optimization algorithm recently proposed by M\"uller et al. (2021), and derive convergence rates in both the noisy and noiseless settings.
    Sample-efficient Multi-objective Molecular Optimization with GFlowNets. (arXiv:2302.04040v2 [cs.LG] UPDATED)
    Many crucial scientific problems involve designing novel molecules with desired properties, which can be formulated as a black-box optimization problem over the discrete chemical space. In practice, multiple conflicting objectives and costly evaluations (e.g., wet-lab experiments) make the diversity of candidates paramount. Computational methods have achieved initial success but still struggle with considering diversity in both objective and search space. To fill this gap, we propose a multi-objective Bayesian optimization (MOBO) algorithm leveraging the hypernetwork-based GFlowNets (HN-GFN) as an acquisition function optimizer, with the purpose of sampling a diverse batch of candidate molecular graphs from an approximate Pareto front. Using a single preference-conditioned hypernetwork, HN-GFN learns to explore various trade-offs between objectives. We further propose a hindsight-like off-policy strategy to share high-performing molecules among different preferences in order to speed up learning for HN-GFN. We empirically illustrate that HN-GFN has adequate capacity to generalize over preferences. Moreover, experiments in various real-world MOBO settings demonstrate that our framework predominantly outperforms existing methods in terms of candidate quality and sample efficiency. The code is available at https://github.com/violet-sto/HN-GFN.
    EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models. (arXiv:2307.02028v2 [cs.LG] UPDATED)
    While the general machine learning (ML) community has benefited from public datasets, tasks, and models, the progress of ML in healthcare has been hampered by a lack of such shared assets. The success of foundation models creates new challenges for healthcare ML by requiring access to shared pretrained models to validate performance benefits. We help address these challenges through three contributions. First, we publish a new dataset, EHRSHOT, which contains deidentified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and not restricted to ICU/ED patients. Second, we publish the weights of CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. We are one of the first to fully release such a model for coded EHR data; in contrast, most prior models released for clinical data (e.g. GatorTron, ClinicalBERT) only work with unstructured text and cannot process the rich, structured data within an EHR. We provide an end-to-end pipeline for the community to validate and build upon its performance. Third, we define 15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaptation. Our model and dataset are available via a research data use agreement from the Stanford AIMI Center. Code to reproduce our results are available at our Github repo: https://github.com/som-shahlab/ehrshot-benchmark
    A deep learning experiment for semantic segmentation of overlapping characters in palimpsests. (arXiv:2311.01130v1 [cs.CV])
    Palimpsests refer to historical manuscripts where erased writings have been partially covered by the superimposition of a second writing. By employing imaging techniques, e.g., multispectral imaging, it becomes possible to identify features that are imperceptible to the naked eye, including faded and erased inks. When dealing with overlapping inks, Artificial Intelligence techniques can be utilized to disentangle complex nodes of overlapping letters. In this work, we propose deep learning-based semantic segmentation as a method for identifying and segmenting individual letters in overlapping characters. The experiment was conceived as a proof of concept, focusing on the palimpsests of the Ars Grammatica by Prisciano as a case study. Furthermore, caveats and prospects of our approach combined with multispectral imaging are also discussed.
    AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models. (arXiv:2311.01305v1 [cs.LG])
    Large language models(LLMs) exhibit excellent performance across a variety of tasks, but they come with significant computational and storage costs. Quantizing these models is an effective way to alleviate this issue. However, existing methods struggle to strike a balance between model accuracy and hardware efficiency. This is where we introduce AWEQ, a post-training method that requires no additional training overhead. AWEQ excels in both ultra-low-bit quantization and 8-bit weight and activation (W8A8) quantization. There is an observation that weight quantization is less challenging than activation quantization. AWEQ transfers the difficulty of activation quantization to weights using channel equalization, achieving a balance between the quantization difficulties of both, and thereby maximizing performance. We have further refined the equalization method to mitigate quantization bias error, ensuring the robustness of the model. Extensive experiments on popular models such as LLaMA and OPT demonstrate that AWEQ outperforms all existing post-training quantization methods for large models.
    Role of Structural and Conformational Diversity for Machine Learning Potentials. (arXiv:2311.00862v1 [physics.chem-ph])
    In the field of Machine Learning Interatomic Potentials (MLIPs), understanding the intricate relationship between data biases, specifically conformational and structural diversity, and model generalization is critical in improving the quality of Quantum Mechanics (QM) data generation efforts. We investigate these dynamics through two distinct experiments: a fixed budget one, where the dataset size remains constant, and a fixed molecular set one, which focuses on fixed structural diversity while varying conformational diversity. Our results reveal nuanced patterns in generalization metrics. Notably, for optimal structural and conformational generalization, a careful balance between structural and conformational diversity is required, but existing QM datasets do not meet that trade-off. Additionally, our results highlight the limitation of the MLIP models at generalizing beyond their training distribution, emphasizing the importance of defining applicability domain during model deployment. These findings provide valuable insights and guidelines for QM data generation efforts.
    Improving Robustness via Tilted Exponential Layer: A Communication-Theoretic Perspective. (arXiv:2311.01047v1 [cs.LG])
    State-of-the-art techniques for enhancing robustness of deep networks mostly rely on empirical risk minimization with suitable data augmentation. In this paper, we propose a complementary approach motivated by communication theory, aimed at enhancing the signal-to-noise ratio at the output of a neural network layer via neural competition during learning and inference. In addition to minimization of a standard end-to-end cost, neurons compete to sparsely represent layer inputs by maximization of a tilted exponential (TEXP) objective function for the layer. TEXP learning can be interpreted as maximum likelihood estimation of matched filters under a Gaussian model for data noise. Inference in a TEXP layer is accomplished by replacing batch norm by a tilted softmax, which can be interpreted as computation of posterior probabilities for the competing signaling hypotheses represented by each neuron. After providing insights via simplified models, we show, by experimentation on standard image datasets, that TEXP learning and inference enhances robustness against noise and other common corruptions, without requiring data augmentation. Further cumulative gains in robustness against this array of distortions can be obtained by appropriately combining TEXP with data augmentation techniques.
    Learning to Design and Use Tools for Robotic Manipulation. (arXiv:2311.00754v1 [cs.RO])
    When limited by their own morphologies, humans and some species of animals have the remarkable ability to use objects from the environment toward accomplishing otherwise impossible tasks. Robots might similarly unlock a range of additional capabilities through tool use. Recent techniques for jointly optimizing morphology and control via deep learning are effective at designing locomotion agents. But while outputting a single morphology makes sense for locomotion, manipulation involves a variety of strategies depending on the task goals at hand. A manipulation agent must be capable of rapidly prototyping specialized tools for different goals. Therefore, we propose learning a designer policy, rather than a single design. A designer policy is conditioned on task information and outputs a tool design that helps solve the task. A design-conditioned controller policy can then perform manipulation using these tools. In this work, we take a step towards this goal by introducing a reinforcement learning framework for jointly learning these policies. Through simulated manipulation tasks, we show that this framework is more sample efficient than prior methods in multi-goal or multi-variant settings, can perform zero-shot interpolation or fine-tuning to tackle previously unseen goals, and allows tradeoffs between the complexity of design and control policies under practical constraints. Finally, we deploy our learned policies onto a real robot. Please see our supplementary video and website at https://robotic-tool-design.github.io/ for visualizations.
    Exploring Deep Learning Techniques for Glaucoma Detection: A Comprehensive Review. (arXiv:2311.01425v1 [eess.IV])
    Glaucoma is one of the primary causes of vision loss around the world, necessitating accurate and efficient detection methods. Traditional manual detection approaches have limitations in terms of cost, time, and subjectivity. Recent developments in deep learning approaches demonstrate potential in automating glaucoma detection by detecting relevant features from retinal fundus images. This article provides a comprehensive overview of cutting-edge deep learning methods used for the segmentation, classification, and detection of glaucoma. By analyzing recent studies, the effectiveness and limitations of these techniques are evaluated, key findings are highlighted, and potential areas for further research are identified. The use of deep learning algorithms may significantly improve the efficacy, usefulness, and accuracy of glaucoma detection. The findings from this research contribute to the ongoing advancements in automated glaucoma detection and have implications for improving patient outcomes and reducing the global burden of glaucoma.
    Deep Transformed Gaussian Processes. (arXiv:2310.18230v2 [cs.LG] UPDATED)
    Transformed Gaussian Processes (TGPs) are stochastic processes specified by transforming samples from the joint distribution from a prior process (typically a GP) using an invertible transformation; increasing the flexibility of the base process. Furthermore, they achieve competitive results compared with Deep Gaussian Processes (DGPs), which are another generalization constructed by a hierarchical concatenation of GPs. In this work, we propose a generalization of TGPs named Deep Transformed Gaussian Processes (DTGPs), which follows the trend of concatenating layers of stochastic processes. More precisely, we obtain a multi-layer model in which each layer is a TGP. This generalization implies an increment of flexibility with respect to both TGPs and DGPs. Exact inference in such a model is intractable. However, we show that one can use variational inference to approximate the required computations yielding a straightforward extension of the popular DSVI inference algorithm Salimbeni et al (2017). The experiments conducted evaluate the proposed novel DTGPs in multiple regression datasets, achieving good scalability and performance.
    AutoMixer for Improved Multivariate Time-Series Forecasting on Business and IT Observability Data. (arXiv:2310.20280v2 [cs.LG] UPDATED)
    The efficiency of business processes relies on business key performance indicators (Biz-KPIs), that can be negatively impacted by IT failures. Business and IT Observability (BizITObs) data fuses both Biz-KPIs and IT event channels together as multivariate time series data. Forecasting Biz-KPIs in advance can enhance efficiency and revenue through proactive corrective measures. However, BizITObs data generally exhibit both useful and noisy inter-channel interactions between Biz-KPIs and IT events that need to be effectively decoupled. This leads to suboptimal forecasting performance when existing multivariate forecasting models are employed. To address this, we introduce AutoMixer, a time-series Foundation Model (FM) approach, grounded on the novel technique of channel-compressed pretrain and finetune workflows. AutoMixer leverages an AutoEncoder for channel-compressed pretraining and integrates it with the advanced TSMixer model for multivariate time series forecasting. This fusion greatly enhances the potency of TSMixer for accurate forecasts and also generalizes well across several downstream tasks. Through detailed experiments and dashboard analytics, we show AutoMixer's capability to consistently improve the Biz-KPI's forecasting accuracy (by 11-15\%) which directly translates to actionable business insights.
    Generalizing Nonlinear ICA Beyond Structural Sparsity. (arXiv:2311.00866v1 [cs.LG])
    Nonlinear independent component analysis (ICA) aims to uncover the true latent sources from their observable nonlinear mixtures. Despite its significance, the identifiability of nonlinear ICA is known to be impossible without additional assumptions. Recent advances have proposed conditions on the connective structure from sources to observed variables, known as Structural Sparsity, to achieve identifiability in an unsupervised manner. However, the sparsity constraint may not hold universally for all sources in practice. Furthermore, the assumptions of bijectivity of the mixing process and independence among all sources, which arise from the setting of ICA, may also be violated in many real-world scenarios. To address these limitations and generalize nonlinear ICA, we propose a set of new identifiability results in the general settings of undercompleteness, partial sparsity and source dependence, and flexible grouping structures. Specifically, we prove identifiability when there are more observed variables than sources (undercomplete), and when certain sparsity and/or source independence assumptions are not met for some changing sources. Moreover, we show that even in cases with flexible grouping structures (e.g., part of the sources can be divided into irreducible independent groups with various sizes), appropriate identifiability results can also be established. Theoretical claims are supported empirically on both synthetic and real-world datasets.
    Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models. (arXiv:2311.00871v1 [cs.LG])
    Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-context learning (ICL) -- to perform new tasks when prompted with unseen input-output examples without any explicit model training. In this work, we study how effectively transformers can bridge between their pretraining data mixture, comprised of multiple distinct task families, to identify and learn new tasks in-context which are both inside and outside the pretraining distribution. Building on previous work, we investigate this question in a controlled setting, where we study transformer models trained on sequences of $(x, f(x))$ pairs rather than natural language. Our empirical results show transformers demonstrate near-optimal unsupervised model selection capabilities, in their ability to first in-context identify different task families and in-context learn within them when the task families are well-represented in their pretraining data. However when presented with tasks or functions which are out-of-domain of their pretraining data, we demonstrate various failure modes of transformers and degradation of their generalization for even simple extrapolation tasks. Together our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.
    COSTAR: Improved Temporal Counterfactual Estimation with Self-Supervised Learning. (arXiv:2311.00886v1 [cs.LG])
    Estimation of temporal counterfactual outcomes from observed history is crucial for decision-making in many domains such as healthcare and e-commerce, particularly when randomized controlled trials (RCTs) suffer from high cost or impracticality. For real-world datasets, modeling time-dependent confounders is challenging due to complex dynamics, long-range dependencies and both past treatments and covariates affecting the future outcomes. In this paper, we introduce COunterfactual Self-supervised TrAnsformeR (COSTAR), a novel approach that integrates self-supervised learning for improved historical representations. The proposed framework combines temporal and feature-wise attention with a component-wise contrastive loss tailored for temporal treatment outcome observations, yielding superior performance in estimation accuracy and generalization to out-of-distribution data compared to existing models, as validated by empirical results on both synthetic and real-world datasets.
    Deep learning-based Edge-aware pre and post-processing methods for JPEG compressed images. (arXiv:2104.04926v2 [eess.IV] UPDATED)
    We propose a learning-based compression scheme that envelopes a standard codec between pre and post-processing deep CNNs. Specifically, we demonstrate improvements over prior approaches utilizing a compression-decompression network by introducing: (a) an edge-aware loss function to prevent blurring that is commonly occurred in prior works & (b) a super-resolution convolutional neural network (CNN) for post-processing along with a corresponding pre-processing network for improved rate-distortion performance in the low rate regime. The algorithm is assessed on a variety of datasets varying from low to high resolution namely Set 5, Set 7, Classic 5, Set 14, Live 1, Kodak, General 100, CLIC 2019. When compared to JPEG, JPEG2000, BPG, and recent CNN approach, the proposed algorithm contributes significant improvement in PSNR with an approximate gain of 20.75%, 8.47%, 3.22%, 3.23% and 24.59%, 14.46%, 10.14%, 8.57% at low and high bit-rates respectively. Similarly, this improvement in MS-SSIM is approximately 71.43%, 50%, 36.36%, 23.08%, 64.70% and 64.47%, 61.29%, 47.06%, 51.52%, 16.28% at low and high bit-rates respectively. With CLIC 2019 dataset, PSNR is found to be superior with approximately 16.67%, 10.53%, 6.78%, and 24.62%, 17.39%, 14.08% at low and high bit-rates respectively, over JPEG2000, BPG, and recent CNN approach. Similarly, the MS-SSIM is found to be superior with approximately 72%, 45.45%, 39.13%, 18.52%, and 71.43%, 50%, 41.18%, 17.07% at low and high bit-rates respectively, compared to the same approaches. A similar type of improvement is achieved with other datasets also.
    Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy. (arXiv:2311.01002v1 [cs.LG])
    Data pruning, which aims to downsize a large training set into a small informative subset, is crucial for reducing the enormous computational costs of modern deep learning. Though large-scale data collections invariably contain annotation noise and numerous robust learning methods have been developed, data pruning for the noise-robust learning scenario has received little attention. With state-of-the-art Re-labeling methods that self-correct erroneous labels while training, it is challenging to identify which subset induces the most accurate re-labeling of erroneous labels in the entire training set. In this paper, we formalize the problem of data pruning with re-labeling. We first show that the likelihood of a training example being correctly re-labeled is proportional to the prediction confidence of its neighborhood in the subset. Therefore, we propose a novel data pruning algorithm, Prune4Rel, that finds a subset maximizing the total neighborhood confidence of all training examples, thereby maximizing the re-labeling accuracy and generalization performance. Extensive experiments on four real and one synthetic noisy datasets show that \algname{} outperforms the baselines with Re-labeling models by up to 9.1% as well as those with a standard model by up to 21.6%.
    Self-Supervised Pre-Training with Contrastive and Masked Autoencoder Methods for Dealing with Small Datasets in Deep Learning for Medical Imaging. (arXiv:2308.06534v4 [cs.CV] UPDATED)
    Deep learning in medical imaging has the potential to minimize the risk of diagnostic errors, reduce radiologist workload, and accelerate diagnosis. Training such deep learning models requires large and accurate datasets, with annotations for all training samples. However, in the medical imaging domain, annotated datasets for specific tasks are often small due to the high complexity of annotations, limited access, or the rarity of diseases. To address this challenge, deep learning models can be pre-trained on large image datasets without annotations using methods from the field of self-supervised learning. After pre-training, small annotated datasets are sufficient to fine-tune the models for a specific task. The most popular self-supervised pre-training approaches in medical imaging are based on contrastive learning. However, recent studies in natural image processing indicate a strong potential for masked autoencoder approaches. Our work compares state-of-the-art contrastive learning methods with the recently introduced masked autoencoder approach "SparK" for convolutional neural networks (CNNs) on medical images. Therefore we pre-train on a large unannotated CT image dataset and fine-tune on several CT classification tasks. Due to the challenge of obtaining sufficient annotated training data in medical imaging, it is of particular interest to evaluate how the self-supervised pre-training methods perform when fine-tuning on small datasets. By experimenting with gradually reducing the training dataset size for fine-tuning, we find that the reduction has different effects depending on the type of pre-training chosen. The SparK pre-training method is more robust to the training dataset size than the contrastive methods. Based on our results, we propose the SparK pre-training for medical imaging tasks with only small annotated datasets.
    Learned Visual Features to Textual Explanations. (arXiv:2309.00733v2 [cs.CV] UPDATED)
    Interpreting the learned features of vision models has posed a longstanding challenge in the field of machine learning. To address this issue, we propose a novel method that leverages the capabilities of large language models (LLMs) to interpret the learned features of pre-trained image classifiers. Our method, called TExplain, tackles this task by training a neural network to establish a connection between the feature space of image classifiers and LLMs. Then, during inference, our approach generates a vast number of sentences to explain the features learned by the classifier for a given image. These sentences are then used to extract the most frequent words, providing a comprehensive understanding of the learned features and patterns within the classifier. Our method, for the first time, utilizes these frequent words corresponding to a visual representation to provide insights into the decision-making process of the independently trained classifier, enabling the detection of spurious correlations, biases, and a deeper comprehension of its behavior. To validate the effectiveness of our approach, we conduct experiments on diverse datasets, including ImageNet-9L and Waterbirds. The results demonstrate the potential of our method to enhance the interpretability and robustness of image classifiers.
    Gaussian Mixture Solvers for Diffusion Models. (arXiv:2311.00941v1 [cs.LG])
    Recently, diffusion models have achieved great success in generative tasks. Sampling from diffusion models is equivalent to solving the reverse diffusion stochastic differential equations (SDEs) or the corresponding probability flow ordinary differential equations (ODEs). In comparison, SDE-based solvers can generate samples of higher quality and are suited for image translation tasks like stroke-based synthesis. During inference, however, existing SDE-based solvers are severely constrained by the efficiency-effectiveness dilemma. Our investigation suggests that this is because the Gaussian assumption in the reverse transition kernel is frequently violated (even in the case of simple mixture data) given a limited number of discretization steps. To overcome this limitation, we introduce a novel class of SDE-based solvers called \emph{Gaussian Mixture Solvers (GMS)} for diffusion models. Our solver estimates the first three-order moments and optimizes the parameters of a Gaussian mixture transition kernel using generalized methods of moments in each step during sampling. Empirically, our solver outperforms numerous SDE-based solvers in terms of sample quality in image generation and stroke-based synthesis in various diffusion models, which validates the motivation and effectiveness of GMS. Our code is available at https://github.com/Guohanzhong/GMS.
    An energy-based comparative analysis of common approaches to text classification in the Legal domain. (arXiv:2311.01256v1 [cs.CL])
    Most Machine Learning research evaluates the best solutions in terms of performance. However, in the race for the best performing model, many important aspects are often overlooked when, on the contrary, they should be carefully considered. In fact, sometimes the gaps in performance between different approaches are neglectable, whereas factors such as production costs, energy consumption, and carbon footprint must take into consideration. Large Language Models (LLMs) are extensively adopted to address NLP problems in academia and industry. In this work, we present a detailed quantitative comparison of LLM and traditional approaches (e.g. SVM) on the LexGLUE benchmark, which takes into account both performance (standard indices) and alternative metrics such as timing, power consumption and cost, in a word: the carbon-footprint. In our analysis, we considered the prototyping phase (model selection by training-validation-test iterations) and in-production phases separately, since they follow different implementation procedures and also require different resources. The results indicate that very often, the simplest algorithms achieve performance very close to that of large LLMs but with very low power consumption and lower resource demands. The results obtained could suggest companies to include additional evaluations in the choice of Machine Learning (ML) solutions.
    Invariant-Feature Subspace Recovery: A New Class of Provable Domain Generalization Algorithms. (arXiv:2311.00966v1 [cs.LG])
    Domain generalization asks for models trained over a set of training environments to generalize well in unseen test environments. Recently, a series of algorithms such as Invariant Risk Minimization (IRM) have been proposed for domain generalization. However, Rosenfeld et al. (2021) shows that in a simple linear data model, even if non-convexity issues are ignored, IRM and its extensions cannot generalize to unseen environments with less than $d_s+1$ training environments, where $d_s$ is the dimension of the spurious-feature subspace. In this work, we propose Invariant-feature Subspace Recovery (ISR): a new class of algorithms to achieve provable domain generalization across the settings of classification and regression problems. First, in the binary classification setup of Rosenfeld et al. (2021), we show that our first algorithm, ISR-Mean, can identify the subspace spanned by invariant features from the first-order moments of the class-conditional distributions, and achieve provable domain generalization with $d_s+1$ training environments. Our second algorithm, ISR-Cov, further reduces the required number of training environments to $O(1)$ using the information of second-order moments. Notably, unlike IRM, our algorithms bypass non-convexity issues and enjoy global convergence guarantees. Next, we extend ISR-Mean to the more general setting of multi-class classification and propose ISR-Multiclass, which leverages class information and provably recovers the invariant-feature subspace with $\lceil d_s/k\rceil+1$ training environments for $k$-class classification. Finally, for regression problems, we propose ISR-Regression that can identify the invariant-feature subspace with $d_s+1$ training environments. Empirically, we demonstrate the superior performance of our ISRs on synthetic benchmarks. Further, ISR can be used as post-processing methods for feature extractors such as neural nets.
    E3 TTS: Easy End-to-End Diffusion-based Text to Speech. (arXiv:2311.00945v1 [cs.SD])
    We propose Easy End-to-End Diffusion-based Text to Speech, a simple and efficient end-to-end text-to-speech model based on diffusion. E3 TTS directly takes plain text as input and generates an audio waveform through an iterative refinement process. Unlike many prior work, E3 TTS does not rely on any intermediate representations like spectrogram features or alignment information. Instead, E3 TTS models the temporal structure of the waveform through the diffusion process. Without relying on additional conditioning information, E3 TTS could support flexible latent structure within the given audio. This enables E3 TTS to be easily adapted for zero-shot tasks such as editing without any additional training. Experiments show that E3 TTS can generate high-fidelity audio, approaching the performance of a state-of-the-art neural TTS system. Audio samples are available at https://e3tts.github.io.
    A Multi-Agent Reinforcement Learning Framework for Evaluating the U.S. Ending the HIV Epidemic Plan. (arXiv:2311.00855v1 [cs.AI])
    Human immunodeficiency virus (HIV) is a major public health concern in the United States, with about 1.2 million people living with HIV and 35,000 newly infected each year. There are considerable geographical disparities in HIV burden and care access across the U.S. The 2019 Ending the HIV Epidemic (EHE) initiative aims to reduce new infections by 90% by 2030, by improving coverage of diagnoses, treatment, and prevention interventions and prioritizing jurisdictions with high HIV prevalence. Identifying optimal scale-up of intervention combinations will help inform resource allocation. Existing HIV decision analytic models either evaluate specific cities or the overall national population, thus overlooking jurisdictional interactions or differences. In this paper, we propose a multi-agent reinforcement learning (MARL) model, that enables jurisdiction-specific decision analyses but in an environment with cross-jurisdictional epidemiological interactions. In experimental analyses, conducted on jurisdictions within California and Florida, optimal policies from MARL were significantly different than those generated from single-agent RL, highlighting the influence of jurisdictional variations and interactions. By using comprehensive modeling of HIV and formulations of state space, action space, and reward functions, this work helps demonstrate the strengths and applicability of MARL for informing public health policies, and provides a framework for expanding to the national-level to inform the EHE.
    H-NeXt: The next step towards roto-translation invariant networks. (arXiv:2311.01111v1 [cs.CV])
    The widespread popularity of equivariant networks underscores the significance of parameter efficient models and effective use of training data. At a time when robustness to unseen deformations is becoming increasingly important, we present H-NeXt, which bridges the gap between equivariance and invariance. H-NeXt is a parameter-efficient roto-translation invariant network that is trained without a single augmented image in the training set. Our network comprises three components: an equivariant backbone for learning roto-translation independent features, an invariant pooling layer for discarding roto-translation information, and a classification layer. H-NeXt outperforms the state of the art in classification on unaugmented training sets and augmented test sets of MNIST and CIFAR-10.
    Batch Bayesian Optimization for Replicable Experimental Design. (arXiv:2311.01195v1 [cs.LG])
    Many real-world experimental design problems (a) evaluate multiple experimental conditions in parallel and (b) replicate each condition multiple times due to large and heteroscedastic observation noise. Given a fixed total budget, this naturally induces a trade-off between evaluating more unique conditions while replicating each of them fewer times vs. evaluating fewer unique conditions and replicating each more times. Moreover, in these problems, practitioners may be risk-averse and hence prefer an input with both good average performance and small variability. To tackle both challenges, we propose the Batch Thompson Sampling for Replicable Experimental Design (BTS-RED) framework, which encompasses three algorithms. Our BTS-RED-Known and BTS-RED-Unknown algorithms, for, respectively, known and unknown noise variance, choose the number of replications adaptively rather than deterministically such that an input with a larger noise variance is replicated more times. As a result, despite the noise heteroscedasticity, both algorithms enjoy a theoretical guarantee and are asymptotically no-regret. Our Mean-Var-BTS-RED algorithm aims at risk-averse optimization and is also asymptotically no-regret. We also show the effectiveness of our algorithms in two practical real-world applications: precision agriculture and AutoML.
    Are These the Same Apple? Comparing Images Based on Object Intrinsics. (arXiv:2311.00750v1 [cs.CV])
    The human visual system can effortlessly recognize an object under different extrinsic factors such as lighting, object poses, and background, yet current computer vision systems often struggle with these variations. An important step to understanding and improving artificial vision systems is to measure image similarity purely based on intrinsic object properties that define object identity. This problem has been studied in the computer vision literature as re-identification, though mostly restricted to specific object categories such as people and cars. We propose to extend it to general object categories, exploring an image similarity metric based on object intrinsics. To benchmark such measurements, we collect the Common paired objects Under differenT Extrinsics (CUTE) dataset of $18,000$ images of $180$ objects under different extrinsic factors such as lighting, poses, and imaging conditions. While existing methods such as LPIPS and CLIP scores do not measure object intrinsics well, we find that combining deep features learned from contrastive self-supervised learning with foreground filtering is a simple yet effective approach to approximating the similarity. We conduct an extensive survey of pre-trained features and foreground extraction methods to arrive at a strong baseline that best measures intrinsic object-centric image similarity among current methods. Finally, we demonstrate that our approach can aid in downstream applications such as acting as an analog for human subjects and improving generalizable re-identification. Please see our project website at https://s-tian.github.io/projects/cute/ for visualizations of the data and demos of our metric.
    Application and Energy-Aware Data Aggregation using Vector Synchronization in Distributed Battery-less IoT Networks. (arXiv:2311.01050v1 [cs.NI])
    The battery-less Internet of Things (IoT) devices are a key element in the sustainable green initiative for the next-generation wireless networks. These battery-free devices use the ambient energy, harvested from the environment. The energy harvesting environment is dynamic and causes intermittent task execution. The harvested energy is stored in small capacitors and it is challenging to assure the application task execution. The main goal is to provide a mechanism to aggregate the sensor data and provide a sustainable application support in the distributed battery-less IoT network. We model the distributed IoT network system consisting of many battery-free IoT sensor hardware modules and heterogeneous IoT applications that are being supported in the device-edge-cloud continuum. The applications require sensor data from a distributed set of battery-less hardware modules and there is provision of joint control over the module actuators. We propose an application-aware task and energy manager (ATEM) for the IoT devices and a vector-synchronization based data aggregator (VSDA). The ATEM is supported by device-level federated energy harvesting and system-level energy-aware heterogeneous application management. In our proposed framework the data aggregator forecasts the available power from the ambient energy harvester using long-short-term-memory (LSTM) model and sets the device profile as well as the application task rates accordingly. Our proposed scheme meets the heterogeneous application requirements with negligible overhead; reduces the data loss and packet delay; increases the hardware component availability; and makes the components available sooner as compared to the state-of-the-art.
    Vision-Language Foundation Models as Effective Robot Imitators. (arXiv:2311.01378v1 [cs.RO])
    Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a straightforward way of making use of existing vision-language models (VLMs) with simple fine-tuning on robotics data. To this end, we derive a simple and novel vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo. Unlike prior works, RoboFlamingo utilizes pre-trained VLMs for single-step vision-language comprehension, models sequential history information with an explicit policy head, and is slightly fine-tuned by imitation learning only on language-conditioned manipulation datasets. Such a decomposition provides RoboFlamingo the flexibility for open-loop control and deployment on low-performance platforms. By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control. Our extensive experimental results also reveal several interesting conclusions regarding the behavior of different pre-trained VLMs on manipulation tasks. We believe RoboFlamingo has the potential to be a cost-effective and easy-to-use solution for robotics manipulation, empowering everyone with the ability to fine-tune their own robotics policy.
    Stochastic Smoothed Gradient Descent Ascent for Federated Minimax Optimization. (arXiv:2311.00944v1 [stat.ML])
    In recent years, federated minimax optimization has attracted growing interest due to its extensive applications in various machine learning tasks. While Smoothed Alternative Gradient Descent Ascent (Smoothed-AGDA) has proved its success in centralized nonconvex minimax optimization, how and whether smoothing technique could be helpful in federated setting remains unexplored. In this paper, we propose a new algorithm termed Federated Stochastic Smoothed Gradient Descent Ascent (FESS-GDA), which utilizes the smoothing technique for federated minimax optimization. We prove that FESS-GDA can be uniformly used to solve several classes of federated minimax problems and prove new or better analytical convergence results for these settings. We showcase the practical efficiency of FESS-GDA in practical federated learning tasks of training generative adversarial networks (GANs) and fair classification.
    Conformal Prediction for Time Series with Modern Hopfield Networks. (arXiv:2303.12783v2 [cs.LG] UPDATED)
    To quantify uncertainty, conformal prediction methods are gaining continuously more interest and have already been successfully applied to various domains. However, they are difficult to apply to time series as the autocorrelative structure of time series violates basic assumptions required by conformal prediction. We propose HopCPT, a novel conformal prediction approach for time series that not only copes with temporal structures but leverages them. We show that our approach is theoretically well justified for time series where temporal dependencies are present. In experiments, we demonstrate that our new approach outperforms state-of-the-art conformal prediction methods on multiple real-world time series datasets from four different domains.
    Deep Learning for real-time neural decoding of grasp. (arXiv:2311.01061v1 [cs.LG])
    Neural decoding involves correlating signals acquired from the brain to variables in the physical world like limb movement or robot control in Brain Machine Interfaces. In this context, this work starts from a specific pre-existing dataset of neural recordings from monkey motor cortex and presents a Deep Learning-based approach to the decoding of neural signals for grasp type classification. Specifically, we propose here an approach that exploits LSTM networks to classify time series containing neural data (i.e., spike trains) into classes representing the object being grasped. The main goal of the presented approach is to improve over state-of-the-art decoding accuracy without relying on any prior neuroscience knowledge, and leveraging only the capability of deep learning models to extract correlations from data. The paper presents the results achieved for the considered dataset and compares them with previous works on the same dataset, showing a significant improvement in classification accuracy, even if considering simulated real-time decoding.
    A Study of Continual Learning Under Language Shift. (arXiv:2311.01200v1 [cs.CL])
    The recent increase in data and model scale for language model pre-training has led to huge training costs. In scenarios where new data become available over time, updating a model instead of fully retraining it would therefore provide significant gains. In this paper, we study the benefits and downsides of updating a language model when new data comes from new languages - the case of continual learning under language shift. Starting from a monolingual English language model, we incrementally add data from Norwegian and Icelandic to investigate how forward and backward transfer effects depend on the pre-training order and characteristics of languages, for different model sizes and learning rate schedulers. Our results show that, while forward transfer is largely positive and independent of language order, backward transfer can be either positive or negative depending on the order and characteristics of new languages. To explain these patterns we explore several language similarity metrics and find that syntactic similarity appears to have the best correlation with our results.
    Respiratory Anomaly Detection using Reflected Infrared Light-wave Signals. (arXiv:2311.01367v1 [eess.SP])
    In this study, we present a non-contact respiratory anomaly detection method using incoherent light-wave signals reflected from the chest of a mechanical robot that can breathe like human beings. In comparison to existing radar and camera-based sensing systems for vitals monitoring, this technology uses only a low-cost ubiquitous light source (e.g., infrared light emitting diode) and sensor (e.g., photodetector). This light-wave sensing (LWS) system recognizes different breathing anomalies from the variations of light intensity reflected from the chest of the robot within a 0.5m-1.5m range. The anomaly detection model demonstrates up to 96.6% average accuracy in classifying 7 different types of breathing data using machine learning. The model can also detect faulty data collected by the system that does not contain breathing information. The developed system can be utilized at home or healthcare facilities as a smart, non-contact and discreet respiration monitoring method.
    SIESTA: Efficient Online Continual Learning with Sleep. (arXiv:2303.10725v3 [cs.CV] UPDATED)
    In supervised continual learning, a deep neural network (DNN) is updated with an ever-growing data stream. Unlike the offline setting where data is shuffled, we cannot make any distributional assumptions about the data stream. Ideally, only one pass through the dataset is needed for computational efficiency. However, existing methods are inadequate and make many assumptions that cannot be made for real-world applications, while simultaneously failing to improve computational efficiency. In this paper, we propose a novel continual learning method, SIESTA based on wake/sleep framework for training, which is well aligned to the needs of on-device learning. The major goal of SIESTA is to advance compute efficient continual learning so that DNNs can be updated efficiently using far less time and energy. The principal innovations of SIESTA are: 1) rapid online updates using a rehearsal-free, backpropagation-free, and data-driven network update rule during its wake phase, and 2) expedited memory consolidation using a compute-restricted rehearsal policy during its sleep phase. For memory efficiency, SIESTA adapts latent rehearsal using memory indexing from REMIND. Compared to REMIND and prior arts, SIESTA is far more computationally efficient, enabling continual learning on ImageNet-1K in under 2 hours on a single GPU; moreover, in the augmentation-free setting it matches the performance of the offline learner, a milestone critical to driving adoption of continual learning in real-world applications.
    Training Dynamics of Contextual N-Grams in Language Models. (arXiv:2311.00863v1 [cs.LG])
    Prior work has shown the existence of contextual neurons in language models, including a neuron that activates on German text. We show that this neuron exists within a broader contextual n-gram circuit: we find late layer neurons which recognize and continue n-grams common in German text, but which only activate if the German neuron is active. We investigate the formation of this circuit throughout training and find that it is an example of what we call a second-order circuit. In particular, both the constituent n-gram circuits and the German detection circuit which culminates in the German neuron form with independent functions early in training - the German detection circuit partially through modeling German unigram statistics, and the n-grams by boosting appropriate completions. Only after both circuits have already formed do they fit together into a second-order circuit. Contrary to the hypotheses presented in prior work, we find that the contextual n-gram circuit forms gradually rather than in a sudden phase transition. We further present a range of anomalous observations such as a simultaneous phase transition in many tasks coinciding with the learning rate warm-up, and evidence that many context neurons form simultaneously early in training but are later unlearned.
    A Review and Roadmap of Deep Causal Model from Different Causal Structures and Representations. (arXiv:2311.00923v1 [cs.LG])
    The fusion of causal models with deep learning introducing increasingly intricate data sets, such as the causal associations within images or between textual components, has surfaced as a focal research area. Nonetheless, the broadening of original causal concepts and theories to such complex, non-statistical data has been met with serious challenges. In response, our study proposes redefinitions of causal data into three distinct categories from the standpoint of causal structure and representation: definite data, semi-definite data, and indefinite data. Definite data chiefly pertains to statistical data used in conventional causal scenarios, while semi-definite data refers to a spectrum of data formats germane to deep learning, including time-series, images, text, and others. Indefinite data is an emergent research sphere inferred from the progression of data forms by us. To comprehensively present these three data paradigms, we elaborate on their formal definitions, differences manifested in datasets, resolution pathways, and development of research. We summarize key tasks and achievements pertaining to definite and semi-definite data from myriad research undertakings, present a roadmap for indefinite data, beginning with its current research conundrums. Lastly, we classify and scrutinize the key datasets presently utilized within these three paradigms.
    Monotone Generative Modeling via a Gromov-Monge Embedding. (arXiv:2311.01375v1 [cs.LG])
    Generative Adversarial Networks (GANs) are powerful tools for creating new content, but they face challenges such as sensitivity to starting conditions and mode collapse. To address these issues, we propose a deep generative model that utilizes the Gromov-Monge embedding (GME). It helps identify the low-dimensional structure of the underlying measure of the data and then maps it, while preserving its geometry, into a measure in a low-dimensional latent space, which is then optimally transported to the reference measure. We guarantee the preservation of the underlying geometry by the GME and $c$-cyclical monotonicity of the generative map, where $c$ is an intrinsic embedding cost employed by the GME. The latter property is a first step in guaranteeing better robustness to initialization of parameters and mode collapse. Numerical experiments demonstrate the effectiveness of our approach in generating high-quality images, avoiding mode collapse, and exhibiting robustness to different starting conditions.
    Bounding Wasserstein distance with couplings. (arXiv:2112.03152v3 [stat.CO] UPDATED)
    Markov chain Monte Carlo (MCMC) provides asymptotically consistent estimates of intractable posterior expectations as the number of iterations tends to infinity. However, in large data applications, MCMC can be computationally expensive per iteration. This has catalyzed interest in approximating MCMC in a manner that improves computational speed per iteration but does not produce asymptotically consistent estimates. In this article, we propose estimators based on couplings of Markov chains to assess the quality of such asymptotically biased sampling methods. The estimators give empirical upper bounds of the Wasserstein distance between the limiting distribution of the asymptotically biased sampling method and the original target distribution of interest. We establish theoretical guarantees for our upper bounds and show that our estimators can remain effective in high dimensions. We apply our quality measures to stochastic gradient MCMC, variational Bayes, and Laplace approximations for tall data and to approximate MCMC for Bayesian logistic regression in 4500 dimensions and Bayesian linear regression in 50000 dimensions.
    A Review of Digital Twins and their Application in Cybersecurity based on Artificial Intelligence. (arXiv:2311.01154v1 [cs.CR])
    The potential of digital twin technology is yet to be fully realized due to its diversity and untapped potential. Digital twins enable systems' analysis, design, optimization, and evolution to be performed digitally or in conjunction with a cyber-physical approach to improve speed, accuracy, and efficiency over traditional engineering methods. Industry 4.0, factories of the future, and digital twins continue to benefit from the technology and provide enhanced efficiency within existing systems. Due to the lack of information and security standards associated with the transition to cyber digitization, cybercriminals have been able to take advantage of the situation. Access to a digital twin of a product or service is equivalent to threatening the entire collection. There is a robust interaction between digital twins and artificial intelligence tools, which leads to strong interaction between these technologies, so it can be used to improve the cybersecurity of these digital platforms based on their integration with these technologies. This study aims to investigate the role of artificial intelligence in providing cybersecurity for digital twin versions of various industries, as well as the risks associated with these versions. In addition, this research serves as a road map for researchers and others interested in cybersecurity and digital security.
    Federated Learning on Edge Sensing Devices: A Review. (arXiv:2311.01201v1 [cs.LG])
    The ability to monitor ambient characteristics, interact with them, and derive information about the surroundings has been made possible by the rapid proliferation of edge sensing devices like IoT, mobile, and wearable devices and their measuring capabilities with integrated sensors. Even though these devices are small and have less capacity for data storage and processing, they produce vast amounts of data. Some example application areas where sensor data is collected and processed include healthcare, environmental (including air quality and pollution levels), automotive, industrial, aerospace, and agricultural applications. These enormous volumes of sensing data collected from the edge devices are analyzed using a variety of Machine Learning (ML) and Deep Learning (DL) approaches. However, analyzing them on the cloud or a server presents challenges related to privacy, hardware, and connectivity limitations. Federated Learning (FL) is emerging as a solution to these problems while preserving privacy by jointly training a model without sharing raw data. In this paper, we review the FL strategies from the perspective of edge sensing devices to get over the limitations of conventional machine learning techniques. We focus on the key FL principles, software frameworks, and testbeds. We also explore the current sensor technologies, properties of the sensing devices and sensing applications where FL is utilized. We conclude with a discussion on open issues and future research directions on FL for further studies
    Diffusion Models for Reinforcement Learning: A Survey. (arXiv:2311.01223v1 [cs.LG])
    Diffusion models have emerged as a prominent class of generative models, surpassing previous methods regarding sample quality and training stability. Recent works have shown the advantages of diffusion models in improving reinforcement learning (RL) solutions, including as trajectory planners, expressive policy classes, data synthesizers, etc. This survey aims to provide an overview of the advancements in this emerging field and hopes to inspire new avenues of research. First, we examine several challenges encountered by current RL algorithms. Then, we present a taxonomy of existing methods based on the roles played by diffusion models in RL and explore how the existing challenges are addressed. We further outline successful applications of diffusion models in various RL-related tasks while discussing the limitations of current approaches. Finally, we conclude the survey and offer insights into future research directions, focusing on enhancing model performance and applying diffusion models to broader tasks. We are actively maintaining a GitHub repository for papers and other related resources in applying diffusion models in RL: https://github.com/apexrl/Diff4RLSurvey .
    Releasing Graph Neural Networks with Differential Privacy Guarantees. (arXiv:2109.08907v2 [cs.LG] UPDATED)
    With the increasing popularity of graph neural networks (GNNs) in several sensitive applications like healthcare and medicine, concerns have been raised over the privacy aspects of trained GNNs. More notably, GNNs are vulnerable to privacy attacks, such as membership inference attacks, even if only black-box access to the trained model is granted. We propose PrivGNN, a privacy-preserving framework for releasing GNN models in a centralized setting. Assuming an access to a public unlabeled graph, PrivGNN provides a framework to release GNN models trained explicitly on public data along with knowledge obtained from the private data in a privacy preserving manner. PrivGNN combines the knowledge-distillation framework with the two noise mechanisms, random subsampling, and noisy labeling, to ensure rigorous privacy guarantees. We theoretically analyze our approach in the Renyi differential privacy framework. Besides, we show the solid experimental performance of our method compared to several baselines adapted for graph-structured data. Our code is available at https://github.com/iyempissy/privGnn.
    Score-based Data Assimilation for a Two-Layer Quasi-Geostrophic Model. (arXiv:2310.01853v2 [stat.ML] UPDATED)
    Data assimilation addresses the problem of identifying plausible state trajectories of dynamical systems given noisy or incomplete observations. In geosciences, it presents challenges due to the high-dimensionality of geophysical dynamical systems, often exceeding millions of dimensions. This work assesses the scalability of score-based data assimilation (SDA), a novel data assimilation method, in the context of such systems. We propose modifications to the score network architecture aimed at significantly reducing memory consumption and execution time. We demonstrate promising results for a two-layer quasi-geostrophic model.
    An Integrated Framework Integrating Monte Carlo Tree Search and Supervised Learning for Train Timetabling Problem. (arXiv:2311.00971v1 [cs.LG])
    The single-track railway train timetabling problem (TTP) is an important and complex problem. This article proposes an integrated Monte Carlo Tree Search (MCTS) computing framework that combines heuristic methods, unsupervised learning methods, and supervised learning methods for solving TTP in discrete action spaces. This article first describes the mathematical model and simulation system dynamics of TTP, analyzes the characteristics of the solution from the perspective of MCTS, and proposes some heuristic methods to improve MCTS. This article considers these methods as planners in the proposed framework. Secondly, this article utilizes deep convolutional neural networks to approximate the value of nodes and further applies them to the MCTS search process, referred to as learners. The experiment shows that the proposed heuristic MCTS method is beneficial for solving TTP; The algorithm framework that integrates planners and learners can improve the data efficiency of solving TTP; The proposed method provides a new paradigm for solving TTP.
    Selectively Sharing Experiences Improves Multi-Agent Reinforcement Learning. (arXiv:2311.00865v1 [cs.LG])
    We present a novel multi-agent RL approach, Selective Multi-Agent Prioritized Experience Relay, in which agents share with other agents a limited number of transitions they observe during training. The intuition behind this is that even a small number of relevant experiences from other agents could help each agent learn. Unlike many other multi-agent RL algorithms, this approach allows for largely decentralized training, requiring only a limited communication channel between agents. We show that our approach outperforms baseline no-sharing decentralized training and state-of-the art multi-agent RL algorithms. Further, sharing only a small number of highly relevant experiences outperforms sharing all experiences between agents, and the performance uplift from selective experience sharing is robust across a range of hyperparameters and DQN variants. A reference implementation of our algorithm is available at https://github.com/mgerstgrasser/super.
    Representing Edge Flows on Graphs via Sparse Cell Complexes. (arXiv:2309.01632v3 [cs.SI] UPDATED)
    Obtaining sparse, interpretable representations of observable data is crucial in many machine learning and signal processing tasks. For data representing flows along the edges of a graph, an intuitively interpretable way to obtain such representations is to lift the graph structure to a simplicial complex: The eigenvectors of the associated Hodge-Laplacian, respectively the incidence matrices of the corresponding simplicial complex then induce a Hodge decomposition, which can be used to represent the observed data in terms of gradient, curl, and harmonic flows. In this paper, we generalize this approach to cellular complexes and introduce the flow representation learning problem, i.e., the problem of augmenting the observed graph by a set of cells, such that the eigenvectors of the associated Hodge Laplacian provide a sparse, interpretable representation of the observed edge flows on the graph. We show that this problem is NP-hard and introduce an efficient approximation algorithm for its solution. Experiments on real-world and synthetic data demonstrate that our algorithm outperforms state-of-the-art methods with respect to approximation error, while being computationally efficient.
    PET Tracer Conversion among Brain PET via Variable Augmented Invertible Network. (arXiv:2311.00735v1 [cs.LG])
    Positron emission tomography (PET), as an imaging technique with high biochemical sensitivity, has been widely used in diagnosis of encephalopathy and brain science research used in brain disease diagnosis and brain science research. Since different tracers present different effects on the same focal area, the choice of tracers is getting more significant for PET imaging. Nowadays, with the wide application of PET imaging in neuropsychiatric treatment, 6-18F-fluoro-3, 4-dihydroxy-L-phenylalanine (DOPA) has been found to be more effective than 18F-labeled fluorine-2-deoxyglucose (FDG) in this field. However, due to the complexity of its preparation and other limitations, DOPA is far less widely used than FDG. To address this issue, a tracer conversion invertible neural network (TC-INN) for image projection is developed to map FDG images to DOPA images through deep learning. More diagnostic information is obtained by generating PET images from FDG to DOPA. Specifically, the proposed TC-INN consists of two separate phases, one for training the traceable data, the other for re-building the new data. The reference DOPA PET image is used as the learning target for the corresponding network during the training process of tracer conversion. Mean-while, the invertible network iteratively estimates the resultant DOPA PET data and compares it to the reference DOPA PET data. Notably, the reversible model employed variable enhancement techniques to achieve better power generation. Moreover, image registration needs to be performed before training due to the angular deviation of the acquired FDG and DOPA data information. Experimental results show generative ability in mapping be-tween FDG images and DOPA images. It demonstrates great potential for PET image conversion in the case of limited tracer applications.
    VQA-GEN: A Visual Question Answering Benchmark for Domain Generalization. (arXiv:2311.00807v1 [cs.CV])
    Visual question answering (VQA) models are designed to demonstrate visual-textual reasoning capabilities. However, their real-world applicability is hindered by a lack of comprehensive benchmark datasets. Existing domain generalization datasets for VQA exhibit a unilateral focus on textual shifts while VQA being a multi-modal task contains shifts across both visual and textual domains. We propose VQA-GEN, the first ever multi-modal benchmark dataset for distribution shift generated through a shift induced pipeline. Experiments demonstrate VQA-GEN dataset exposes the vulnerability of existing methods to joint multi-modal distribution shifts. validating that comprehensive multi-modal shifts are critical for robust VQA generalization. Models trained on VQA-GEN exhibit improved cross-domain and in-domain performance, confirming the value of VQA-GEN. Further, we analyze the importance of each shift technique of our pipeline contributing to the generalization of the model.
    MIST: Defending Against Membership Inference Attacks Through Membership-Invariant Subspace Training. (arXiv:2311.00919v1 [cs.CR])
    In Member Inference (MI) attacks, the adversary try to determine whether an instance is used to train a machine learning (ML) model. MI attacks are a major privacy concern when using private data to train ML models. Most MI attacks in the literature take advantage of the fact that ML models are trained to fit the training data well, and thus have very low loss on training instances. Most defenses against MI attacks therefore try to make the model fit the training data less well. Doing so, however, generally results in lower accuracy. We observe that training instances have different degrees of vulnerability to MI attacks. Most instances will have low loss even when not included in training. For these instances, the model can fit them well without concerns of MI attacks. An effective defense only needs to (possibly implicitly) identify instances that are vulnerable to MI attacks and avoids overfitting them. A major challenge is how to achieve such an effect in an efficient training process. Leveraging two distinct recent advancements in representation learning: counterfactually-invariant representations and subspace learning methods, we introduce a novel Membership-Invariant Subspace Training (MIST) method to defend against MI attacks. MIST avoids overfitting the vulnerable instances without significant impact on other instances. We have conducted extensive experimental studies, comparing MIST with various other state-of-the-art (SOTA) MI defenses against several SOTA MI attacks. We find that MIST outperforms other defenses while resulting in minimal reduction in testing accuracy.
    Federated Linear Bandits with Finite Adversarial Actions. (arXiv:2311.00973v1 [cs.LG])
    We study a federated linear bandits model, where $M$ clients communicate with a central server to solve a linear contextual bandits problem with finite adversarial action sets that may be different across clients. To address the unique challenges of adversarial finite action sets, we propose the FedSupLinUCB algorithm, which extends the principles of SupLinUCB and OFUL algorithms in linear contextual bandits. We prove that FedSupLinUCB achieves a total regret of $\tilde{O}(\sqrt{d T})$, where $T$ is the total number of arm pulls from all clients, and $d$ is the ambient dimension of the linear model. This matches the minimax lower bound and thus is order-optimal (up to polylog terms). We study both asynchronous and synchronous cases and show that the communication cost can be controlled as $O(d M^2 \log(d)\log(T))$ and $O(\sqrt{d^3 M^3} \log(d))$, respectively. The FedSupLinUCB design is further extended to two scenarios: (1) variance-adaptive, where a total regret of $\tilde{O} (\sqrt{d \sum \nolimits_{t=1}^{T} \sigma_t^2})$ can be achieved with $\sigma_t^2$ being the noise variance of round $t$; and (2) adversarial corruption, where a total regret of $\tilde{O}(\sqrt{dT} + d C_p)$ can be achieved with $C_p$ being the total corruption budget. Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of FedSupLinUCB on both synthetic and real-world datasets.
    Autonomous Learning of Generative Models with Chemical Reaction Network Ensembles. (arXiv:2311.00975v1 [q-bio.MN])
    Can a micron sized sack of interacting molecules autonomously learn an internal model of a complex and fluctuating environment? We draw insights from control theory, machine learning theory, chemical reaction network theory, and statistical physics to develop a general architecture whereby a broad class of chemical systems can autonomously learn complex distributions. Our construction takes the form of a chemical implementation of machine learning's optimization workhorse: gradient descent on the relative entropy cost function. We show how this method can be applied to optimize any detailed balanced chemical reaction network and that the construction is capable of using hidden units to learn complex distributions. This result is then recast as a form of integral feedback control. Finally, due to our use of an explicit physical model of learning, we are able to derive thermodynamic costs and trade-offs associated to this process.
    Better with Less: A Data-Active Perspective on Pre-Training Graph Neural Networks. (arXiv:2311.01038v1 [cs.LG])
    Pre-training on graph neural networks (GNNs) aims to learn transferable knowledge for downstream tasks with unlabeled data, and it has recently become an active research area. The success of graph pre-training models is often attributed to the massive amount of input data. In this paper, however, we identify the curse of big data phenomenon in graph pre-training: more training data do not necessarily lead to better downstream performance. Motivated by this observation, we propose a better-with-less framework for graph pre-training: fewer, but carefully chosen data are fed into a GNN model to enhance pre-training. The proposed pre-training pipeline is called the data-active graph pre-training (APT) framework, and is composed of a graph selector and a pre-training model. The graph selector chooses the most representative and instructive data points based on the inherent properties of graphs as well as predictive uncertainty. The proposed predictive uncertainty, as feedback from the pre-training model, measures the confidence level of the model in the data. When fed with the chosen data, on the other hand, the pre-training model grasps an initial understanding of the new, unseen data, and at the same time attempts to remember the knowledge learned from previous data. Therefore, the integration and interaction between these two components form a unified framework (APT), in which graph pre-training is performed in a progressive and iterative way. Experiment results show that the proposed APT is able to obtain an efficient pre-training model with fewer training data and better downstream performance.
    A Coreset-based, Tempered Variational Posterior for Accurate and Scalable Stochastic Gaussian Process Inference. (arXiv:2311.01409v1 [cs.LG])
    We present a novel stochastic variational Gaussian process ($\mathcal{GP}$) inference method, based on a posterior over a learnable set of weighted pseudo input-output points (coresets). Instead of a free-form variational family, the proposed coreset-based, variational tempered family for $\mathcal{GP}$s (CVTGP) is defined in terms of the $\mathcal{GP}$ prior and the data-likelihood; hence, accommodating the modeling inductive biases. We derive CVTGP's lower bound for the log-marginal likelihood via marginalization of the proposed posterior over latent $\mathcal{GP}$ coreset variables, and show it is amenable to stochastic optimization. CVTGP reduces the learnable parameter size to $\mathcal{O}(M)$, enjoys numerical stability, and maintains $\mathcal{O}(M^3)$ time- and $\mathcal{O}(M^2)$ space-complexity, by leveraging a coreset-based tempered posterior that, in turn, provides sparse and explainable representations of the data. Results on simulated and real-world regression problems with Gaussian observation noise validate that CVTGP provides better evidence lower-bound estimates and predictive root mean squared error than alternative stochastic $\mathcal{GP}$ inference methods.
    Getting aligned on representational alignment. (arXiv:2310.13018v2 [q-bio.NC] UPDATED)
    Biological and artificial information processing systems form representations that they can use to categorize, reason, plan, navigate, and make decisions. How can we measure the extent to which the representations formed by these diverse systems agree? Do similarities in representations then translate into similar behavior? How can a system's representations be modified to better match those of another system? These questions pertaining to the study of representational alignment are at the heart of some of the most active research areas in cognitive science, neuroscience, and machine learning. For example, cognitive scientists measure the representational alignment of multiple individuals to identify shared cognitive priors, neuroscientists align fMRI responses from multiple individuals into a shared representational space for group-level analyses, and ML researchers distill knowledge from teacher models into student models by increasing their alignment. Unfortunately, there is limited knowledge transfer between research communities interested in representational alignment, so progress in one field often ends up being rediscovered independently in another. Thus, greater cross-field communication would be advantageous. To improve communication between these fields, we propose a unifying framework that can serve as a common language between researchers studying representational alignment. We survey the literature from all three fields and demonstrate how prior work fits into this framework. Finally, we lay out open problems in representational alignment where progress can benefit all three of these fields. We hope that our work can catalyze cross-disciplinary collaboration and accelerate progress for all communities studying and developing information processing systems. We note that this is a working paper and encourage readers to reach out with their suggestions for future revisions.
    Understanding and Improving Ensemble Adversarial Defense. (arXiv:2310.18477v2 [cs.LG] UPDATED)
    The strategy of ensemble has become popular in adversarial defense, which trains multiple base classifiers to defend against adversarial attacks in a cooperative manner. Despite the empirical success, theoretical explanations on why an ensemble of adversarially trained classifiers is more robust than single ones remain unclear. To fill in this gap, we develop a new error theory dedicated to understanding ensemble adversarial defense, demonstrating a provable 0-1 loss reduction on challenging sample sets in an adversarial defense scenario. Guided by this theory, we propose an effective approach to improve ensemble adversarial defense, named interactive global adversarial training (iGAT). The proposal includes (1) a probabilistic distributing rule that selectively allocates to different base classifiers adversarial examples that are globally challenging to the ensemble, and (2) a regularization term to rescue the severest weaknesses of the base classifiers. Being tested over various existing ensemble adversarial defense techniques, iGAT is capable of boosting their performance by increases up to 17% evaluated using CIFAR10 and CIFAR100 datasets under both white-box and black-box attacks.
    Bridging the Gap: Addressing Discrepancies in Diffusion Model Training for Classifier-Free Guidance. (arXiv:2311.00938v1 [cs.LG])
    Diffusion models have emerged as a pivotal advancement in generative models, setting new standards to the quality of the generated instances. In the current paper we aim to underscore a discrepancy between conventional training methods and the desired conditional sampling behavior of these models. While the prevalent classifier-free guidance technique works well, it's not without flaws. At higher values for the guidance scale parameter $w$, we often get out of distribution samples and mode collapse, whereas at lower values for $w$ we may not get the desired specificity. To address these challenges, we introduce an updated loss function that better aligns training objectives with sampling behaviors. Experimental validation with FID scores on CIFAR-10 elucidates our method's ability to produce higher quality samples with fewer sampling timesteps, and be more robust to the choice of guidance scale $w$. We also experiment with fine-tuning Stable Diffusion on the proposed loss, to provide early evidence that large diffusion models may also benefit from this refined loss function.
    KP-RNN: A Deep Learning Pipeline for Human Motion Prediction and Synthesis of Performance Art. (arXiv:2210.04366v3 [cs.CV] UPDATED)
    Digitally synthesizing human motion is an inherently complex process, which can create obstacles in application areas such as virtual reality. We offer a new approach for predicting human motion, KP-RNN, a neural network which can integrate easily with existing image processing and generation pipelines. We utilize a new human motion dataset of performance art, Take The Lead, as well as the motion generation pipeline, the Everybody Dance Now system, to demonstrate the effectiveness of KP-RNN's motion predictions. We have found that our neural network can predict human dance movements effectively, which serves as a baseline result for future works using the Take The Lead dataset. Since KP-RNN can work alongside a system such as Everybody Dance Now, we argue that our approach could inspire new methods for rendering human avatar animation. This work also serves to benefit the visualization of performance art in digital platforms by utilizing accessible neural networks.
    Manifold-augmented Eikonal Equations: Geodesic Distances and Flows on Differentiable Manifolds. (arXiv:2310.06157v2 [cs.CG] UPDATED)
    Manifolds discovered by machine learning models provide a compact representation of the underlying data. Geodesics on these manifolds define locally length-minimising curves and provide a notion of distance, which are key for reduced-order modelling, statistical inference, and interpolation. In this work, we propose a model-based parameterisation for distance fields and geodesic flows on manifolds, exploiting solutions of a manifold-augmented Eikonal equation. We demonstrate how the geometry of the manifold impacts the distance field, and exploit the geodesic flow to obtain globally length-minimising curves directly. This work opens opportunities for statistics and reduced-order modelling on differentiable manifolds.
    Scalable Probabilistic Forecasting in Retail with Gradient Boosted Trees: A Practitioner's Approach. (arXiv:2311.00993v1 [cs.LG])
    The recent M5 competition has advanced the state-of-the-art in retail forecasting. However, we notice important differences between the competition challenge and the challenges we face in a large e-commerce company. The datasets in our scenario are larger (hundreds of thousands of time series), and e-commerce can afford to have a larger assortment than brick-and-mortar retailers, leading to more intermittent data. To scale to larger dataset sizes with feasible computational effort, firstly, we investigate a two-layer hierarchy and propose a top-down approach to forecasting at an aggregated level with less amount of series and intermittency, and then disaggregating to obtain the decision-level forecasts. Probabilistic forecasts are generated under distributional assumptions. Secondly, direct training at the lower level with subsamples can also be an alternative way of scaling. Performance of modelling with subsets is evaluated with the main dataset. Apart from a proprietary dataset, the proposed scalable methods are evaluated using the Favorita dataset and the M5 dataset. We are able to show the differences in characteristics of the e-commerce and brick-and-mortar retail datasets. Notably, our top-down forecasting framework enters the top 50 of the original M5 competition, even with models trained at a higher level under a much simpler setting.
    Wasserstein Quantum Monte Carlo: A Novel Approach for Solving the Quantum Many-Body Schr\"odinger Equation. (arXiv:2307.07050v3 [physics.comp-ph] UPDATED)
    Solving the quantum many-body Schr\"odinger equation is a fundamental and challenging problem in the fields of quantum physics, quantum chemistry, and material sciences. One of the common computational approaches to this problem is Quantum Variational Monte Carlo (QVMC), in which ground-state solutions are obtained by minimizing the energy of the system within a restricted family of parameterized wave functions. Deep learning methods partially address the limitations of traditional QVMC by representing a rich family of wave functions in terms of neural networks. However, the optimization objective in QVMC remains notoriously hard to minimize and requires second-order optimization methods such as natural gradient. In this paper, we first reformulate energy functional minimization in the space of Born distributions corresponding to particle-permutation (anti-)symmetric wave functions, rather than the space of wave functions. We then interpret QVMC as the Fisher-Rao gradient flow in this distributional space, followed by a projection step onto the variational manifold. This perspective provides us with a principled framework to derive new QMC algorithms, by endowing the distributional space with better metrics, and following the projected gradient flow induced by those metrics. More specifically, we propose "Wasserstein Quantum Monte Carlo" (WQMC), which uses the gradient flow induced by the Wasserstein metric, rather than Fisher-Rao metric, and corresponds to transporting the probability mass, rather than teleporting it. We demonstrate empirically that the dynamics of WQMC results in faster convergence to the ground state of molecular systems.
    Sharp Noisy Binary Search with Monotonic Probabilities. (arXiv:2311.00840v1 [cs.DS])
    We revisit the noisy binary search model of Karp and Kleinberg, in which we have $n$ coins with unknown probabilities $p_i$ that we can flip. The coins are sorted by increasing $p_i$, and we would like to find where the probability crosses (to within $\varepsilon$) of a target value $\tau$. This generalized the fixed-noise model of Burnashev and Zigangirov , in which $p_i = \frac{1}{2} \pm \varepsilon$, to a setting where coins near the target may be indistinguishable from it. Karp and Kleinberg showed that $\Theta(\frac{1}{\varepsilon^2} \log n)$ samples are necessary and sufficient for this task. We produce a practical algorithm by solving two theoretical challenges: high-probability behavior and sharp constants. We give an algorithm that succeeds with probability $1-\delta$ from \[ \frac{1}{C_{\tau, \varepsilon}} \cdot \left(\lg n + O(\log^{2/3} n \log^{1/3} \frac{1}{\delta} + \log \frac{1}{\delta})\right) \] samples, where $C_{\tau, \varepsilon}$ is the optimal such constant achievable. For $\delta > n^{-o(1)}$ this is within $1 + o(1)$ of optimal, and for $\delta \ll 1$ it is the first bound within constant factors of optimal.
    LocoGAN -- Locally Convolutional GAN. (arXiv:2002.07897v2 [eess.IV] UPDATED)
    In the paper we construct a fully convolutional GAN model: LocoGAN, which latent space is given by noise-like images of possibly different resolutions. The learning is local, i.e. we process not the whole noise-like image, but the sub-images of a fixed size. As a consequence LocoGAN can produce images of arbitrary dimensions e.g. LSUN bedroom data set. Another advantage of our approach comes from the fact that we use the position channels, which allows the generation of fully periodic (e.g. cylindrical panoramic images) or almost periodic ,,infinitely long" images (e.g. wall-papers).
    Deception Game: Closing the Safety-Learning Loop in Interactive Robot Autonomy. (arXiv:2309.01267v2 [cs.RO] UPDATED)
    An outstanding challenge for the widespread deployment of robotic systems like autonomous vehicles is ensuring safe interaction with humans without sacrificing performance. Existing safety methods often neglect the robot's ability to learn and adapt at runtime, leading to overly conservative behavior. This paper proposes a new closed-loop paradigm for synthesizing safe control policies that explicitly account for the robot's evolving uncertainty and its ability to quickly respond to future scenarios as they arise, by jointly considering the physical dynamics and the robot's learning algorithm. We leverage adversarial reinforcement learning for tractable safety analysis under high-dimensional learning dynamics and demonstrate our framework's ability to work with both Bayesian belief propagation and implicit learning through large pre-trained neural trajectory predictors.
    Inversion of Bayesian Networks. (arXiv:2212.10649v2 [cs.LG] UPDATED)
    Variational autoencoders and Helmholtz machines use a recognition network (encoder) to approximate the posterior distribution of a generative model (decoder). In this paper we study the necessary and sufficient properties of a recognition network so that it can model the true posterior distribution exactly. These results are derived in the general context of probabilistic graphical modelling / Bayesian networks, for which the network represents a set of conditional independence statements. We derive both global conditions, in terms of d-separation, and local conditions for the recognition network to have the desired qualities. It turns out that for the local conditions the property perfectness (for every node, all parents are joined) plays an important role.
    Parting with Misconceptions about Learning-based Vehicle Motion Planning. (arXiv:2306.07962v2 [cs.RO] UPDATED)
    The release of nuPlan marks a new era in vehicle motion planning research, offering the first large-scale real-world dataset and evaluation schemes requiring both precise short-term planning and long-horizon ego-forecasting. Existing systems struggle to simultaneously meet both requirements. Indeed, we find that these tasks are fundamentally misaligned and should be addressed independently. We further assess the current state of closed-loop planning in the field, revealing the limitations of learning-based methods in complex real-world scenarios and the value of simple rule-based priors such as centerline selection through lane graph search algorithms. More surprisingly, for the open-loop sub-task, we observe that the best results are achieved when using only this centerline as scene context (i.e., ignoring all information regarding the map and other agents). Combining these insights, we propose an extremely simple and efficient planner which outperforms an extensive set of competitors, winning the nuPlan planning challenge 2023.
    Dyadic Reinforcement Learning. (arXiv:2308.07843v5 [cs.LG] UPDATED)
    Mobile health aims to enhance health outcomes by delivering interventions to individuals as they go about their daily life. The involvement of care partners and social support networks often proves crucial in helping individuals managing burdensome medical conditions. This presents opportunities in mobile health to design interventions that target the dyadic relationship -- the relationship between a target person and their care partner -- with the aim of enhancing social support. In this paper, we develop dyadic RL, an online reinforcement learning algorithm designed to personalize intervention delivery based on contextual factors and past responses of a target person and their care partner. Here, multiple sets of interventions impact the dyad across multiple time intervals. The developed dyadic RL is Bayesian and hierarchical. We formally introduce the problem setup, develop dyadic RL and establish a regret bound. We demonstrate dyadic RL's empirical performance through simulation studies on both toy scenarios and on a realistic test bed constructed from data collected in a mobile health study.
    Normalizing flows as approximations of optimal transport maps via linear-control neural ODEs. (arXiv:2311.01404v1 [math.OC])
    The term "Normalizing Flows" is related to the task of constructing invertible transport maps between probability measures by means of deep neural networks. In this paper, we consider the problem of recovering the $W_2$-optimal transport map $T$ between absolutely continuous measures $\mu,\nu\in\mathcal{P}(\mathbb{R}^n)$ as the flow of a linear-control neural ODE. We first show that, under suitable assumptions on $\mu,\nu$ and on the controlled vector fields, the optimal transport map is contained in the $C^0_c$-closure of the flows generated by the system. Assuming that discrete approximations $\mu_N,\nu_N$ of the original measures $\mu,\nu$ are available, we use a discrete optimal coupling $\gamma_N$ to define an optimal control problem. With a $\Gamma$-convergence argument, we prove that its solutions correspond to flows that approximate the optimal transport map $T$. Finally, taking advantage of the Pontryagin Maximum Principle, we propose an iterative numerical scheme for the resolution of the optimal control problem, resulting in an algorithm for the practical computation of the approximated optimal transport map.
    Tipping Points of Evolving Epidemiological Networks: Machine Learning-Assisted, Data-Driven Effective Modeling. (arXiv:2311.00797v1 [cs.LG])
    We study the tipping point collective dynamics of an adaptive susceptible-infected-susceptible (SIS) epidemiological network in a data-driven, machine learning-assisted manner. We identify a parameter-dependent effective stochastic differential equation (eSDE) in terms of physically meaningful coarse mean-field variables through a deep-learning ResNet architecture inspired by numerical stochastic integrators. We construct an approximate effective bifurcation diagram based on the identified drift term of the eSDE and contrast it with the mean-field SIS model bifurcation diagram. We observe a subcritical Hopf bifurcation in the evolving network's effective SIS dynamics, that causes the tipping point behavior; this takes the form of large amplitude collective oscillations that spontaneously -- yet rarely -- arise from the neighborhood of a (noisy) stationary state. We study the statistics of these rare events both through repeated brute force simulations and by using established mathematical/computational tools exploiting the right-hand-side of the identified SDE. We demonstrate that such a collective SDE can also be identified (and the rare events computations also performed) in terms of data-driven coarse observables, obtained here via manifold learning techniques, in particular Diffusion Maps. The workflow of our study is straightforwardly applicable to other complex dynamics problems exhibiting tipping point dynamics.
    Effective Human-AI Teams via Learned Natural Language Rules and Onboarding. (arXiv:2311.01007v1 [cs.LG])
    People are relying on AI agents to assist them with various tasks. The human must know when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work, we propose to learn rules grounded in data regions and described in natural language that illustrate how the human should collaborate with the AI. Our novel region discovery algorithm finds local regions in the data as neighborhoods in an embedding space that corrects the human prior. Each region is then described using an iterative and contrastive procedure where a large language model describes the region. We then teach these rules to the human via an onboarding stage. Through user studies on object detection and question-answering tasks, we show that our method can lead to more accurate human-AI teams. We also evaluate our region discovery and description algorithms separately.
    Learning to See Physical Properties with Active Sensing Motor Policies. (arXiv:2311.01405v1 [cs.RO])
    Knowledge of terrain's physical properties inferred from color images can aid in making efficient robotic locomotion plans. However, unlike image classification, it is unintuitive for humans to label image patches with physical properties. Without labeled data, building a vision system that takes as input the observed terrain and predicts physical properties remains challenging. We present a method that overcomes this challenge by self-supervised labeling of images captured by robots during real-world traversal with physical property estimators trained in simulation. To ensure accurate labeling, we introduce Active Sensing Motor Policies (ASMP), which are trained to explore locomotion behaviors that increase the accuracy of estimating physical parameters. For instance, the quadruped robot learns to swipe its foot against the ground to estimate the friction coefficient accurately. We show that the visual system trained with a small amount of real-world traversal data accurately predicts physical parameters. The trained system is robust and works even with overhead images captured by a drone despite being trained on data collected by cameras attached to a quadruped robot walking on the ground.
    Hierarchical Proxy Modeling for Improved HPO in Time Series Forecasting. (arXiv:2211.15092v2 [cs.LG] UPDATED)
    Selecting the right set of hyperparameters is crucial in time series forecasting. The classical temporal cross-validation framework for hyperparameter optimization (HPO) often leads to poor test performance because of a possible mismatch between validation and test periods. To address this test-validation mismatch, we propose a novel technique, H-Pro to drive HPO via test proxies by exploiting data hierarchies often associated with time series datasets. Since higher-level aggregated time series often show less irregularity and better predictability as compared to the lowest-level time series which can be sparse and intermittent, we optimize the hyperparameters of the lowest-level base-forecaster by leveraging the proxy forecasts for the test period generated from the forecasters at higher levels. H-Pro can be applied on any off-the-shelf machine learning model to perform HPO. We validate the efficacy of our technique with extensive empirical evaluation on five publicly available hierarchical forecasting datasets. Our approach outperforms existing state-of-the-art methods in Tourism, Wiki, and Traffic datasets, and achieves competitive result in Tourism-L dataset, without any model-specific enhancements. Moreover, our method outperforms the winning method of the M5 forecast accuracy competition.
    CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. (arXiv:2303.17760v2 [cs.AI] UPDATED)
    The rapid advancement of chat-based language models has led to remarkable progress in complex task-solving. However, their success heavily relies on human input to guide the conversation, which can be challenging and time-consuming. This paper explores the potential of building scalable techniques to facilitate autonomous cooperation among communicative agents, and provides insight into their "cognitive" processes. To address the challenges of achieving autonomous cooperation, we propose a novel communicative agent framework named role-playing. Our approach involves using inception prompting to guide chat agents toward task completion while maintaining consistency with human intentions. We showcase how role-playing can be used to generate conversational data for studying the behaviors and capabilities of a society of agents, providing a valuable resource for investigating conversational language models. In particular, we conduct comprehensive studies on instruction-following cooperation in multi-agent settings. Our contributions include introducing a novel communicative agent framework, offering a scalable approach for studying the cooperative behaviors and capabilities of multi-agent systems, and open-sourcing our library to support research on communicative agents and beyond: https://github.com/camel-ai/camel.
    JADE: A Linguistics-based Safety Evaluation Platform for LLM. (arXiv:2311.00286v2 [cs.CL] UPDATED)
    In this paper, we present JADE, a targeted linguistic fuzzing platform which strengthens the linguistic complexity of seed questions to simultaneously and consistently break a wide range of widely-used LLMs categorized in three groups: eight open-sourced Chinese, six commercial Chinese and four commercial English LLMs. JADE generates three safety benchmarks for the three groups of LLMs, which contain unsafe questions that are highly threatening: the questions simultaneously trigger harmful generation of multiple LLMs, with an average unsafe generation ratio of $70\%$ (please see the table below), while are still natural questions, fluent and preserving the core unsafe semantics. We release the benchmark demos generated for commercial English LLMs and open-sourced English LLMs in the following link: https://github.com/whitzard-ai/jade-db. For readers who are interested in evaluating on more questions generated by JADE, please contact us. JADE is based on Noam Chomsky's seminal theory of transformational-generative grammar. Given a seed question with unsafe intention, JADE invokes a sequence of generative and transformational rules to increment the complexity of the syntactic structure of the original question, until the safety guardrail is broken. Our key insight is: Due to the complexity of human language, most of the current best LLMs can hardly recognize the invariant evil from the infinite number of different syntactic structures which form an unbound example space that can never be fully covered. Technically, the generative/transformative rules are constructed by native speakers of the languages, and, once developed, can be used to automatically grow and transform the parse tree of a given question, until the guardrail is broken. For more evaluation results and demo, please check our website: https://whitzard-ai.github.io/jade.html.
    Investigating Relative Performance of Transfer and Meta Learning. (arXiv:2311.00727v1 [cs.LG])
    Over the past decade, the field of machine learning has experienced remarkable advancements. While image recognition systems have achieved impressive levels of accuracy, they continue to rely on extensive training datasets. Additionally, a significant challenge has emerged in the form of poor out-of-distribution performance, which necessitates retraining neural networks when they encounter conditions that deviate from their training data. This limitation has notably contributed to the slow progress in self-driving car technology. These pressing issues have sparked considerable interest in methods that enable neural networks to learn effectively from limited data. This paper presents the outcomes of an extensive investigation designed to compare two distinct approaches, transfer learning and meta learning, as potential solutions to this problem. The overarching objective was to establish a robust criterion for selecting the most suitable method in diverse machine learning scenarios. Building upon prior research, I expanded the comparative analysis by introducing a new meta learning method into the investigation. Subsequently, I assessed whether the findings remained consistent under varying conditions. Finally, I delved into the impact of altering the size of the training dataset on the relative performance of these methods. This comprehensive exploration has yielded insights into the conditions favoring each approach, thereby facilitating the development of a criterion for selecting the most appropriate method in any given situation
    Conformalized Deep Splines for Optimal and Efficient Prediction Sets. (arXiv:2311.00774v1 [cs.LG])
    Uncertainty estimation is critical in high-stakes machine learning applications. One effective way to estimate uncertainty is conformal prediction, which can provide predictive inference with statistical coverage guarantees. We present a new conformal regression method, Spline Prediction Intervals via Conformal Estimation (SPICE), that estimates the conditional density using neural-network-parameterized splines. We prove universal approximation and optimality results for SPICE, which are empirically validated by our experiments. SPICE is compatible with two different efficient-to-compute conformal scores, one oracle-optimal for marginal coverage (SPICE-ND) and the other asymptotically optimal for conditional coverage (SPICE-HPD). Results on benchmark datasets demonstrate SPICE-ND models achieve the smallest average prediction set sizes, including average size reductions of nearly 50% for some datasets compared to the next best baseline. SPICE-HPD models achieve the best conditional coverage compared to baselines. The SPICE implementation is made available.
    Neural Field Dynamics Model for Granular Object Piles Manipulation. (arXiv:2311.00802v1 [cs.RO])
    We present a learning-based dynamics model for granular material manipulation. Inspired by the Eulerian approach commonly used in fluid dynamics, our method adopts a fully convolutional neural network that operates on a density field-based representation of object piles and pushers, allowing it to exploit the spatial locality of inter-object interactions as well as the translation equivariance through convolution operations. Furthermore, our differentiable action rendering module makes the model fully differentiable and can be directly integrated with a gradient-based trajectory optimization algorithm. We evaluate our model with a wide array of piles manipulation tasks both in simulation and real-world experiments and demonstrate that it significantly exceeds existing latent or particle-based methods in both accuracy and computation efficiency, and exhibits zero-shot generalization capabilities across various environments and tasks.
    Exploring Unified Perspective For Fast Shapley Value Estimation. (arXiv:2311.01010v1 [cs.LG])
    Shapley values have emerged as a widely accepted and trustworthy tool, grounded in theoretical axioms, for addressing challenges posed by black-box models like deep neural networks. However, computing Shapley values encounters exponential complexity in the number of features. Various approaches, including ApproSemivalue, KernelSHAP, and FastSHAP, have been explored to expedite the computation. We analyze the consistency of existing works and conclude that stochastic estimators can be unified as the linear transformation of importance sampling of feature subsets. Based on this, we investigate the possibility of designing simple amortized estimators and propose a straightforward and efficient one, SimSHAP, by eliminating redundant techniques. Extensive experiments conducted on tabular and image datasets validate the effectiveness of our SimSHAP, which significantly accelerates the computation of accurate Shapley values.
    Distance-Based Propagation for Efficient Knowledge Graph Reasoning. (arXiv:2311.01024v1 [cs.LG])
    Knowledge graph completion (KGC) aims to predict unseen edges in knowledge graphs (KGs), resulting in the discovery of new facts. A new class of methods have been proposed to tackle this problem by aggregating path information. These methods have shown tremendous ability in the task of KGC. However they are plagued by efficiency issues. Though there are a few recent attempts to address this through learnable path pruning, they often sacrifice the performance to gain efficiency. In this work, we identify two intrinsic limitations of these methods that affect the efficiency and representation quality. To address the limitations, we introduce a new method, TAGNet, which is able to efficiently propagate information. This is achieved by only aggregating paths in a fixed window for each source-target pair. We demonstrate that the complexity of TAGNet is independent of the number of layers. Extensive experiments demonstrate that TAGNet can cut down on the number of propagated messages by as much as 90% while achieving competitive performance on multiple KG datasets. The code is available at https://github.com/HarryShomer/TAGNet.
    Accelerating Electronic Stopping Power Predictions by 10 Million Times with a Combination of Time-Dependent Density Functional Theory and Machine Learning. (arXiv:2311.00787v1 [cond-mat.mtrl-sci])
    Knowing the rate at which particle radiation releases energy in a material, the stopping power, is key to designing nuclear reactors, medical treatments, semiconductor and quantum materials, and many other technologies. While the nuclear contribution to stopping power, i.e., elastic scattering between atoms, is well understood in the literature, the route for gathering data on the electronic contribution has for decades remained costly and reliant on many simplifying assumptions, including that materials are isotropic. We establish a method that combines time-dependent density functional theory (TDDFT) and machine learning to reduce the time to assess new materials to mere hours on a supercomputer and provides valuable data on how atomic details influence electronic stopping. Our approach uses TDDFT to compute the electronic stopping contributions to stopping power from first principles in several directions and then machine learning to interpolate to other directions at rates 10 million times higher. We demonstrate the combined approach in a study of proton irradiation in aluminum and employ it to predict how the depth of maximum energy deposition, the "Bragg Peak," varies depending on incident angle -- a quantity otherwise inaccessible to modelers. The lack of any experimental information requirement makes our method applicable to most materials, and its speed makes it a prime candidate for enabling quantum-to-continuum models of radiation damage. The prospect of reusing valuable TDDFT data for training the model make our approach appealing for applications in the age of materials data science.
    Sorting with Predictions. (arXiv:2311.00749v1 [cs.DS])
    We explore the fundamental problem of sorting through the lens of learning-augmented algorithms, where algorithms can leverage possibly erroneous predictions to improve their efficiency. We consider two different settings: In the first setting, each item is provided a prediction of its position in the sorted list. In the second setting, we assume there is a "quick-and-dirty" way of comparing items, in addition to slow-and-exact comparisons. For both settings, we design new and simple algorithms using only $O(\sum_i \log \eta_i)$ exact comparisons, where $\eta_i$ is a suitably defined prediction error for the $i$th element. In particular, as the quality of predictions deteriorates, the number of comparisons degrades smoothly from $O(n)$ to $O(n\log n)$. We prove that the comparison complexity is theoretically optimal with respect to the examined error measures. An experimental evaluation against existing adaptive and non-adaptive sorting algorithms demonstrates the potential of applying learning-augmented algorithms in sorting tasks.
    Generalizing Importance Weighting to A Universal Solver for Distribution Shift Problems. (arXiv:2305.14690v2 [cs.LG] UPDATED)
    Distribution shift (DS) may have two levels: the distribution itself changes, and the support (i.e., the set where the probability density is non-zero) also changes. When considering the support change between the training and test distributions, there can be four cases: (i) they exactly match; (ii) the training support is wider (and thus covers the test support); (iii) the test support is wider; (iv) they partially overlap. Existing methods are good at cases (i) and (ii), while cases (iii) and (iv) are more common nowadays but still under-explored. In this paper, we generalize importance weighting (IW), a golden solver for cases (i) and (ii), to a universal solver for all cases. Specifically, we first investigate why IW might fail in cases (iii) and (iv); based on the findings, we propose generalized IW (GIW) that could handle cases (iii) and (iv) and would reduce to IW in cases (i) and (ii). In GIW, the test support is split into an in-training (IT) part and an out-of-training (OOT) part, and the expected risk is decomposed into a weighted classification term over the IT part and a standard classification term over the OOT part, which guarantees the risk consistency of GIW. Then, the implementation of GIW consists of three components: (a) the split of validation data is carried out by the one-class support vector machine, (b) the first term of the empirical risk can be handled by any IW algorithm given training data and IT validation data, and (c) the second term just involves OOT validation data. Experiments demonstrate that GIW is a universal solver for DS problems, outperforming IW methods in cases (iii) and (iv).
    Data-Driven Model Selections of Second-Order Particle Dynamics via Integrating Gaussian Processes with Low-Dimensional Interacting Structures. (arXiv:2311.00902v1 [stat.ML])
    In this paper, we focus on the data-driven discovery of a general second-order particle-based model that contains many state-of-the-art models for modeling the aggregation and collective behavior of interacting agents of similar size and body type. This model takes the form of a high-dimensional system of ordinary differential equations parameterized by two interaction kernels that appraise the alignment of positions and velocities. We propose a Gaussian Process-based approach to this problem, where the unknown model parameters are marginalized by using two independent Gaussian Process (GP) priors on latent interaction kernels constrained to dynamics and observational data. This results in a nonparametric model for interacting dynamical systems that accounts for uncertainty quantification. We also develop acceleration techniques to improve scalability. Moreover, we perform a theoretical analysis to interpret the methodology and investigate the conditions under which the kernels can be recovered. We demonstrate the effectiveness of the proposed approach on various prototype systems, including the selection of the order of the systems and the types of interactions. In particular, we present applications to modeling two real-world fish motion datasets that display flocking and milling patterns up to 248 dimensions. Despite the use of small data sets, the GP-based approach learns an effective representation of the nonlinear dynamics in these spaces and outperforms competitor methods.
    Contrastive Modules with Temporal Attention for Multi-Task Reinforcement Learning. (arXiv:2311.01075v1 [cs.LG])
    In the field of multi-task reinforcement learning, the modular principle, which involves specializing functionalities into different modules and combining them appropriately, has been widely adopted as a promising approach to prevent the negative transfer problem that performance degradation due to conflicts between tasks. However, most of the existing multi-task RL methods only combine shared modules at the task level, ignoring that there may be conflicts within the task. In addition, these methods do not take into account that without constraints, some modules may learn similar functions, resulting in restricting the model's expressiveness and generalization capability of modular methods. In this paper, we propose the Contrastive Modules with Temporal Attention(CMTA) method to address these limitations. CMTA constrains the modules to be different from each other by contrastive learning and combining shared modules at a finer granularity than the task level with temporal attention, alleviating the negative transfer within the task and improving the generalization ability and the performance for multi-task RL. We conducted the experiment on Meta-World, a multi-task RL benchmark containing various robotics manipulation tasks. Experimental results show that CMTA outperforms learning each task individually for the first time and achieves substantial performance improvements over the baselines.
    Contrastive Moments: Unsupervised Halfspace Learning in Polynomial Time. (arXiv:2311.01435v1 [cs.LG])
    We give a polynomial-time algorithm for learning high-dimensional halfspaces with margins in $d$-dimensional space to within desired TV distance when the ambient distribution is an unknown affine transformation of the $d$-fold product of an (unknown) symmetric one-dimensional logconcave distribution, and the halfspace is introduced by deleting at least an $\epsilon$ fraction of the data in one of the component distributions. Notably, our algorithm does not need labels and establishes the unique (and efficient) identifiability of the hidden halfspace under this distributional assumption. The sample and time complexity of the algorithm are polynomial in the dimension and $1/\epsilon$. The algorithm uses only the first two moments of suitable re-weightings of the empirical distribution, which we call contrastive moments; its analysis uses classical facts about generalized Dirichlet polynomials and relies crucially on a new monotonicity property of the moment ratio of truncations of logconcave distributions. Such algorithms, based only on first and second moments were suggested in earlier work, but hitherto eluded rigorous guarantees. Prior work addressed the special case when the underlying distribution is Gaussian via Non-Gaussian Component Analysis. We improve on this by providing polytime guarantees based on Total Variation (TV) distance, in place of existing moment-bound guarantees that can be super-polynomial. Our work is also the first to go beyond Gaussians in this setting.
    Unraveling Fundamental Properties of Power System Resilience Curves using Unsupervised Machine Learning. (arXiv:2310.10030v2 [cs.LG] UPDATED)
    The standard model of infrastructure resilience, the resilience triangle, has been the primary way of characterizing and quantifying infrastructure resilience. However, the theoretical model merely provides a one-size-fits-all framework for all infrastructure systems. Most of the existing studies examine the characteristics of infrastructure resilience curves based on analytical models constructed upon simulated system performance. Limited empirical studies hindered our ability to fully understand and predict resilience characteristics in infrastructure systems. To address this gap, this study examined over 200 resilience curves related to power outages in three major extreme weather events. Using unsupervised machine learning, we examined different curve archetypes, as well as the fundamental properties of each resilience curve archetype. The results show two primary archetypes for power system resilience curves, triangular, and trapezoidal curves. Triangular curves characterize resilience behavior based on 1. critical functionality threshold, 2. critical functionality recovery rate, and 3. recovery pivot point. Trapezoidal archetypes explain resilience curves based on 1. duration of sustained function loss and 2. constant recovery rate. The longer the duration of sustained function loss, the slower the constant rate of recovery. The findings of this study provide novel perspectives enabling better understanding and prediction of resilience performance of power system infrastructures.
    Push it to the Demonstrated Limit: Multimodal Visuotactile Imitation Learning with Force Matching. (arXiv:2311.01248v1 [cs.RO])
    Optical tactile sensors have emerged as an effective means to acquire dense contact information during robotic manipulation. A recently-introduced `see-through-your-skin' (STS) variant of this type of sensor has both visual and tactile modes, enabled by leveraging a semi-transparent surface and controllable lighting. In this work, we investigate the benefits of pairing visuotactile sensing with imitation learning for contact-rich manipulation tasks. First, we use tactile force measurements and a novel algorithm during kinesthetic teaching to yield a force profile that better matches that of the human demonstrator. Second, we add visual/tactile STS mode switching as a control policy output, simplifying the application of the sensor. Finally, we study multiple observation configurations to compare and contrast the value of visual/tactile data (both with and without mode switching) with visual data from a wrist-mounted eye-in-hand camera. We perform an extensive series of experiments on a real robotic manipulator with door-opening and closing tasks, including over 3,000 real test episodes. Our results highlight the importance of tactile sensing for imitation learning, both for data collection to allow force matching, and for policy execution to allow accurate task feedback.
    Calibrated Explanations: with Uncertainty Information and Counterfactuals. (arXiv:2305.02305v2 [cs.AI] UPDATED)
    While local explanations for AI models can offer insights into individual predictions, such as feature importance, they are plagued by issues like instability. The unreliability of feature weights, often skewed due to poorly calibrated ML models, deepens these challenges. Moreover, the critical aspect of feature importance uncertainty remains mostly unaddressed in Explainable AI (XAI). The novel feature importance explanation method presented in this paper, called Calibrated Explanations (CE), is designed to tackle these issues head-on. Built on the foundation of Venn-Abers, CE not only calibrates the underlying model but also delivers reliable feature importance explanations with an exact definition of the feature weights. CE goes beyond conventional solutions by addressing output uncertainty. It accomplishes this by providing uncertainty quantification for both feature weights and the model's probability estimates. Additionally, CE is model-agnostic, featuring easily comprehensible conditional rules and the ability to generate counterfactual explanations with embedded uncertainty quantification. Results from an evaluation with 25 benchmark datasets underscore the efficacy of CE, making it stand as a fast, reliable, stable, and robust solution.
    Enhancing Clustering Representations with Positive Proximity and Cluster Dispersion Learning. (arXiv:2311.00731v1 [cs.LG])
    Contemporary deep clustering approaches often rely on either contrastive or non-contrastive techniques to acquire effective representations for clustering tasks. Contrastive methods leverage negative pairs to achieve homogenous representations but can introduce class collision issues, potentially compromising clustering performance. On the contrary, non-contrastive techniques prevent class collisions but may produce non-uniform representations that lead to clustering collapse. In this work, we propose a novel end-to-end deep clustering approach named PIPCDR, designed to harness the strengths of both approaches while mitigating their limitations. PIPCDR incorporates a positive instance proximity loss and a cluster dispersion regularizer. The positive instance proximity loss ensures alignment between augmented views of instances and their sampled neighbors, enhancing within-cluster compactness by selecting genuinely positive pairs within the embedding space. Meanwhile, the cluster dispersion regularizer maximizes inter-cluster distances while minimizing within-cluster compactness, promoting uniformity in the learned representations. PIPCDR excels in producing well-separated clusters, generating uniform representations, avoiding class collision issues, and enhancing within-cluster compactness. We extensively validate the effectiveness of PIPCDR within an end-to-end Majorize-Minimization framework, demonstrating its competitive performance on moderate-scale clustering benchmark datasets and establishing new state-of-the-art results on large-scale datasets.
    Analysis of Information Propagation in Ethereum Network Using Combined Graph Attention Network and Reinforcement Learning to Optimize Network Efficiency and Scalability. (arXiv:2311.01406v1 [cs.LG])
    Blockchain technology has revolutionized the way information is propagated in decentralized networks. Ethereum plays a pivotal role in facilitating smart contracts and decentralized applications. Understanding information propagation dynamics in Ethereum is crucial for ensuring network efficiency, security, and scalability. In this study, we propose an innovative approach that utilizes Graph Convolutional Networks (GCNs) to analyze the information propagation patterns in the Ethereum network. The first phase of our research involves data collection from the Ethereum blockchain, consisting of blocks, transactions, and node degrees. We construct a transaction graph representation using adjacency matrices to capture the node embeddings; while our major contribution is to develop a combined Graph Attention Network (GAT) and Reinforcement Learning (RL) model to optimize the network efficiency and scalability. It learns the best actions to take in various network states, ultimately leading to improved network efficiency, throughput, and optimize gas limits for block processing. In the experimental evaluation, we analyze the performance of our model on a large-scale Ethereum dataset. We investigate effectively aggregating information from neighboring nodes capturing graph structure and updating node embeddings using GCN with the objective of transaction pattern prediction, accounting for varying network loads and number of blocks. Not only we design a gas limit optimization model and provide the algorithm, but also to address scalability, we demonstrate the use and implementation of sparse matrices in GraphConv, GraphSAGE, and GAT. The results indicate that our designed GAT-RL model achieves superior results compared to other GCN models in terms of performance. It effectively propagates information across the network, optimizing gas limits for block processing and improving network efficiency.
    Meaning Representations from Trajectories in Autoregressive Models. (arXiv:2310.18348v2 [cs.CL] UPDATED)
    We propose to extract meaning representations from autoregressive language models by considering the distribution of all possible trajectories extending an input text. This strategy is prompt-free, does not require fine-tuning, and is applicable to any pre-trained autoregressive model. Moreover, unlike vector-based representations, distribution-based representations can also model asymmetric relations (e.g., direction of logical entailment, hypernym/hyponym relations) by using algebraic operations between likelihood functions. These ideas are grounded in distributional perspectives on semantics and are connected to standard constructions in automata theory, but to our knowledge they have not been applied to modern language models. We empirically show that the representations obtained from large models align well with human annotations, outperform other zero-shot and prompt-free methods on semantic similarity tasks, and can be used to solve more complex entailment and containment tasks that standard embeddings cannot handle. Finally, we extend our method to represent data from different modalities (e.g., image and text) using multimodal autoregressive models.
    AI for Interpretable Chemistry: Predicting Radical Mechanistic Pathways via Contrastive Learning. (arXiv:2311.01118v1 [cs.LG])
    Deep learning-based reaction predictors have undergone significant architectural evolution. However, their reliance on reactions from the US Patent Office results in a lack of interpretable predictions and limited generalization capability to other chemistry domains, such as radical and atmospheric chemistry. To address these challenges, we introduce a new reaction predictor system, RMechRP, that leverages contrastive learning in conjunction with mechanistic pathways, the most interpretable representation of chemical reactions. Specifically designed for radical reactions, RMechRP provides different levels of interpretation of chemical reactions. We develop and train multiple deep-learning models using RMechDB, a public database of radical reactions, to establish the first benchmark for predicting radical reactions. Our results demonstrate the effectiveness of RMechRP in providing accurate and interpretable predictions of radical reactions, and its potential for various applications in atmospheric chemistry.
    FlashDecoding++: Faster Large Language Model Inference on GPUs. (arXiv:2311.01282v1 [cs.LG])
    As the Large Language Model (LLM) becomes increasingly important in various domains. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation requires a synchronized update operation among each partial softmax result, leading to ~20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference is flat, leading to under-utilized computation and >50% performance loss after padding zeros in previous designs. (3) Performance loss due to static dataflow. Kernel performance in LLM depends on varied input data features, hardware configurations, etc. A single and static dataflow may lead to a 50.25% performance loss for GEMMs of different shapes in LLM inference. We present FlashDecoding++, a fast LLM inference engine supporting mainstream LLMs and hardware back-ends. To tackle the above challenges, FlashDecoding++ creatively proposes: (1) Asynchronized softmax with unified max value. FlashDecoding++ introduces a unified max value technique for different partial softmax computations to avoid synchronization. (2) Flat GEMM optimization with double buffering. FlashDecoding++ points out that flat GEMMs with different shapes face varied bottlenecks. Then, techniques like double buffering are introduced. (3) Heuristic dataflow with hardware resource adaptation. FlashDecoding++ heuristically optimizes dataflow using different hardware resource considering input dynamics. Due to the versatility of optimizations in FlashDecoding++, FlashDecoding++ can achieve up to 4.86x and 2.18x speedup on both NVIDIA and AMD GPUs compared to Hugging Face implementations. FlashDecoding++ also achieves an average speedup of 1.37x compared to state-of-the-art LLM inference engines on mainstream LLMs.
    Sanitized Clustering against Confounding Bias. (arXiv:2311.01252v1 [cs.LG])
    Real-world datasets inevitably contain biases that arise from different sources or conditions during data collection. Consequently, such inconsistency itself acts as a confounding factor that disturbs the cluster analysis. Existing methods eliminate the biases by projecting data onto the orthogonal complement of the subspace expanded by the confounding factor before clustering. Therein, the interested clustering factor and the confounding factor are coarsely considered in the raw feature space, where the correlation between the data and the confounding factor is ideally assumed to be linear for convenient solutions. These approaches are thus limited in scope as the data in real applications is usually complex and non-linearly correlated with the confounding factor. This paper presents a new clustering framework named Sanitized Clustering Against confounding Bias (SCAB), which removes the confounding factor in the semantic latent space of complex data through a non-linear dependence measure. To be specific, we eliminate the bias information in the latent space by minimizing the mutual information between the confounding factor and the latent representation delivered by Variational Auto-Encoder (VAE). Meanwhile, a clustering module is introduced to cluster over the purified latent representations. Extensive experiments on complex datasets demonstrate that our SCAB achieves a significant gain in clustering performance by removing the confounding bias. The code is available at \url{https://github.com/EvaFlower/SCAB}.
    Boosting Adversarial Transferability by Achieving Flat Local Maxima. (arXiv:2306.05225v2 [cs.CV] UPDATED)
    Transfer-based attack adopts the adversarial examples generated on the surrogate model to attack various models, making it applicable in the physical world and attracting increasing interest. Recently, various adversarial attacks have emerged to boost adversarial transferability from different perspectives. In this work, inspired by the observation that flat local minima are correlated with good generalization, we assume and empirically validate that adversarial examples at a flat local region tend to have good transferability by introducing a penalized gradient norm to the original loss function. Since directly optimizing the gradient regularization norm is computationally expensive and intractable for generating adversarial examples, we propose an approximation optimization method to simplify the gradient update of the objective function. Specifically, we randomly sample an example and adopt a first-order procedure to approximate the curvature of Hessian/vector product, which makes computing more efficient by interpolating two neighboring gradients. Meanwhile, in order to obtain a more stable gradient direction, we randomly sample multiple examples and average the gradients of these examples to reduce the variance due to random sampling during the iterative process. Extensive experimental results on the ImageNet-compatible dataset show that the proposed method can generate adversarial examples at flat local regions, and significantly improve the adversarial transferability on either normally trained models or adversarially trained models than the state-of-the-art attacks. Our codes are available at: https://github.com/Trustworthy-AI-Group/PGN.
    The Re-Label Method For Data-Centric Machine Learning. (arXiv:2302.04391v6 [cs.LG] UPDATED)
    In industry deep learning application, our manually labeled data has a certain number of noisy data. To solve this problem and achieve more than 90 score in dev dataset, we present a simple method to find the noisy data and re-label the noisy data by human, given the model predictions as references in human labeling. In this paper, we illustrate our idea for a broad set of deep learning tasks, includes classification, sequence tagging, object detection, sequence generation, click-through rate prediction. The dev dataset evaluation results and human evaluation results verify our idea.
    Re-weighting Tokens: A Simple and Effective Active Learning Strategy for Named Entity Recognition. (arXiv:2311.00906v1 [cs.CL])
    Active learning, a widely adopted technique for enhancing machine learning models in text and image classification tasks with limited annotation resources, has received relatively little attention in the domain of Named Entity Recognition (NER). The challenge of data imbalance in NER has hindered the effectiveness of active learning, as sequence labellers lack sufficient learning signals. To address these challenges, this paper presents a novel reweighting-based active learning strategy that assigns dynamic smoothed weights to individual tokens. This adaptable strategy is compatible with various token-level acquisition functions and contributes to the development of robust active learners. Experimental results on multiple corpora demonstrate the substantial performance improvement achieved by incorporating our re-weighting strategy into existing acquisition functions, validating its practical efficacy.
    Real-Time Magnetic Tracking and Diagnosis of COVID-19 via Machine Learning. (arXiv:2311.00737v1 [cs.LG])
    The COVID-19 pandemic underscored the importance of reliable, noninvasive diagnostic tools for robust public health interventions. In this work, we fused magnetic respiratory sensing technology (MRST) with machine learning (ML) to create a diagnostic platform for real-time tracking and diagnosis of COVID-19 and other respiratory diseases. The MRST precisely captures breathing patterns through three specific breath testing protocols: normal breath, holding breath, and deep breath. We collected breath data from both COVID-19 patients and healthy subjects in Vietnam using this platform, which then served to train and validate ML models. Our evaluation encompassed multiple ML algorithms, including support vector machines and deep learning models, assessing their ability to diagnose COVID-19. Our multi-model validation methodology ensures a thorough comparison and grants the adaptability to select the most optimal model, striking a balance between diagnostic precision with model interpretability. The findings highlight the exceptional potential of our diagnostic tool in pinpointing respiratory anomalies, achieving over 90% accuracy. This innovative sensor technology can be seamlessly integrated into healthcare settings for patient monitoring, marking a significant enhancement for the healthcare infrastructure.
    Beyond Ensemble Averages: Leveraging Climate Model Ensembles for Subseasonal Forecasting. (arXiv:2211.15856v2 [cs.LG] UPDATED)
    Producing high-quality forecasts of key climate variables such as temperature and precipitation on subseasonal time scales has long been a gap in operational forecasting. Recent studies have shown promising results using machine learning (ML) models to advance subseasonal forecasting (SSF), but several open questions remain. First, several past approaches use the average of an ensemble of physics-based forecasts as an input feature of these models. However, ensemble forecasts contain information that can aid prediction beyond only the ensemble mean. Second, past methods have focused on average performance, whereas forecasts of extreme events are far more important for planning and mitigation purposes. Third, climate forecasts correspond to a spatially-varying collection of forecasts, and different methods account for spatial variability in the response differently. Trade-offs between different approaches may be mitigated with model stacking. This paper describes the application of a variety of ML methods used to predict monthly average precipitation and two meter temperature using physics-based predictions (ensemble forecasts) and observational data such as relative humidity, pressure at sea level, or geopotential height, two weeks in advance for the whole continental United States. Regression, quantile regression, and tercile classification tasks using linear models, random forests, convolutional neural networks, and stacked models are considered. The proposed models outperform common baselines such as historical averages (or quantiles) and ensemble averages (or quantiles). This paper further includes an investigation of feature importance, trade-offs between using the full ensemble or only the ensemble average, and different modes of accounting for spatial variability.
    DP-Mix: Mixup-based Data Augmentation for Differentially Private Learning. (arXiv:2311.01295v1 [cs.LG])
    Data augmentation techniques, such as simple image transformations and combinations, are highly effective at improving the generalization of computer vision models, especially when training data is limited. However, such techniques are fundamentally incompatible with differentially private learning approaches, due to the latter's built-in assumption that each training image's contribution to the learned model is bounded. In this paper, we investigate why naive applications of multi-sample data augmentation techniques, such as mixup, fail to achieve good performance and propose two novel data augmentation techniques specifically designed for the constraints of differentially private learning. Our first technique, DP-Mix_Self, achieves SoTA classification performance across a range of datasets and settings by performing mixup on self-augmented data. Our second technique, DP-Mix_Diff, further improves performance by incorporating synthetic data from a pre-trained diffusion model into the mixup process. We open-source the code at https://github.com/wenxuan-Bao/DP-Mix.
    SatBird: Bird Species Distribution Modeling with Remote Sensing and Citizen Science Data. (arXiv:2311.00936v1 [cs.LG])
    Biodiversity is declining at an unprecedented rate, impacting ecosystem services necessary to ensure food, water, and human health and well-being. Understanding the distribution of species and their habitats is crucial for conservation policy planning. However, traditional methods in ecology for species distribution models (SDMs) generally focus either on narrow sets of species or narrow geographical areas and there remain significant knowledge gaps about the distribution of species. A major reason for this is the limited availability of data traditionally used, due to the prohibitive amount of effort and expertise required for traditional field monitoring. The wide availability of remote sensing data and the growing adoption of citizen science tools to collect species observations data at low cost offer an opportunity for improving biodiversity monitoring and enabling the modelling of complex ecosystems. We introduce a novel task for mapping bird species to their habitats by predicting species encounter rates from satellite images, and present SatBird, a satellite dataset of locations in the USA with labels derived from presence-absence observation data from the citizen science database eBird, considering summer (breeding) and winter seasons. We also provide a dataset in Kenya representing low-data regimes. We additionally provide environmental data and species range maps for each location. We benchmark a set of baselines on our dataset, including SOTA models for remote sensing tasks. SatBird opens up possibilities for scalably modelling properties of ecosystems worldwide.
    Gradient-free online learning of subgrid-scale dynamics with neural emulators. (arXiv:2310.19385v2 [physics.comp-ph] UPDATED)
    In this paper, we propose a generic algorithm to train machine learning-based subgrid parametrizations online, i.e., with $\textit{a posteriori}$ loss functions for non-differentiable numerical solvers. The proposed approach leverage neural emulators to train an approximation of the reduced state-space solver, which is then used to allows gradient propagation through temporal integration steps. The algorithm is able to recover most of the benefit of online strategies without having to compute the gradient of the original solver. It is demonstrated that training the neural emulator and parametrization components separately with respective loss quantities is necessary in order to minimize the propagation of some approximation bias.
    Additive Decoders for Latent Variables Identification and Cartesian-Product Extrapolation. (arXiv:2307.02598v2 [cs.LG] UPDATED)
    We tackle the problems of latent variables identification and ``out-of-support'' image generation in representation learning. We show that both are possible for a class of decoders that we call additive, which are reminiscent of decoders used for object-centric representation learning (OCRL) and well suited for images that can be decomposed as a sum of object-specific images. We provide conditions under which exactly solving the reconstruction problem using an additive decoder is guaranteed to identify the blocks of latent variables up to permutation and block-wise invertible transformations. This guarantee relies only on very weak assumptions about the distribution of the latent factors, which might present statistical dependencies and have an almost arbitrarily shaped support. Our result provides a new setting where nonlinear independent component analysis (ICA) is possible and adds to our theoretical understanding of OCRL methods. We also show theoretically that additive decoders can generate novel images by recombining observed factors of variations in novel ways, an ability we refer to as Cartesian-product extrapolation. We show empirically that additivity is crucial for both identifiability and extrapolation on simulated data.
    Adapt On-the-Go: Behavior Modulation for Single-Life Robot Deployment. (arXiv:2311.01059v1 [cs.RO])
    To succeed in the real world, robots must cope with situations that differ from those seen during training. We study the problem of adapting on-the-fly to such novel scenarios during deployment, by drawing upon a diverse repertoire of previously learned behaviors. Our approach, RObust Autonomous Modulation (ROAM), introduces a mechanism based on the perceived value of pre-trained behaviors to select and adapt pre-trained behaviors to the situation at hand. Crucially, this adaptation process all happens within a single episode at test time, without any human supervision. We provide theoretical analysis of our selection mechanism and demonstrate that ROAM enables a robot to adapt rapidly to changes in dynamics both in simulation and on a real Go1 quadruped, even successfully moving forward with roller skates on its feet. Our approach adapts over 2x as efficiently compared to existing methods when facing a variety of out-of-distribution situations during deployment by effectively choosing and adapting relevant behaviors on-the-fly.
    Transparent Anomaly Detection via Concept-based Explanations. (arXiv:2310.10702v2 [cs.LG] UPDATED)
    Advancements in deep learning techniques have given a boost to the performance of anomaly detection. However, real-world and safety-critical applications demand a level of transparency and reasoning beyond accuracy. The task of anomaly detection (AD) focuses on finding whether a given sample follows the learned distribution. Existing methods lack the ability to reason with clear explanations for their outcomes. Hence to overcome this challenge, we propose Transparent {A}nomaly Detection {C}oncept {E}xplanations (ACE). ACE is able to provide human interpretable explanations in the form of concepts along with anomaly prediction. To the best of our knowledge, this is the first paper that proposes interpretable by-design anomaly detection. In addition to promoting transparency in AD, it allows for effective human-model interaction. Our proposed model shows either higher or comparable results to black-box uninterpretable models. We validate the performance of ACE across three realistic datasets - bird classification on CUB-200-2011, challenging histopathology slide image classification on TIL-WSI-TCGA, and gender classification on CelebA. We further demonstrate that our concept learning paradigm can be seamlessly integrated with other classification-based AD methods.
    On Finding Bi-objective Pareto-optimal Fraud Prevention Rule Sets for Fintech Applications. (arXiv:2311.00964v1 [cs.LG])
    Rules are widely used in Fintech institutions to make fraud prevention decisions, since rules are highly interpretable thanks to their intuitive if-then structure. In practice, a two-stage framework of fraud prevention decision rule set mining is usually employed in large Fintech institutions. This paper is concerned with finding high-quality rule subsets in a bi-objective space (such as precision and recall) from an initial pool of rules. To this end, we adopt the concept of Pareto optimality and aim to find a set of non-dominated rule subsets, which constitutes a Pareto front. We propose a heuristic-based framework called PORS and we identify that the core of PORS is the problem of solution selection on the front (SSF). We provide a systematic categorization of the SSF problem and a thorough empirical evaluation of various SSF methods on both public and proprietary datasets. We also introduce a novel variant of sequential covering algorithm called SpectralRules to encourage the diversity of the initial rule set and we empirically find that SpectralRules further improves the quality of the found Pareto front. On two real application scenarios within Alipay, we demonstrate the advantages of our proposed methodology compared to existing work.
    Low-latency Real-time Voice Conversion on CPU. (arXiv:2311.00873v1 [cs.SD])
    We adapt the architectures of previous audio manipulation and generation neural networks to the task of real-time any-to-one voice conversion. Our resulting model, LLVC ($\textbf{L}$ow-latency $\textbf{L}$ow-resource $\textbf{V}$oice $\textbf{C}$onversion), has a latency of under 20ms at a bitrate of 16kHz and runs nearly 2.8x faster than real-time on a consumer CPU. LLVC uses both a generative adversarial architecture as well as knowledge distillation in order to attain this performance. To our knowledge LLVC achieves both the lowest resource usage as well as the lowest latency of any open-source voice conversion model. We provide open-source samples, code, and pretrained model weights at https://github.com/KoeAI/LLVC.
    Identifying Alzheimer Disease Dementia Levels Using Machine Learning Methods. (arXiv:2311.01428v1 [cs.LG])
    Dementia, a prevalent neurodegenerative condition, is a major manifestation of Alzheimer's disease (AD). As the condition progresses from mild to severe, it significantly impairs the individual's ability to perform daily tasks independently, necessitating the need for timely and accurate AD classification. Machine learning or deep learning models have emerged as effective tools for this purpose. In this study, we suggested an approach for classifying the four stages of dementia using RF, SVM, and CNN algorithms, augmented with watershed segmentation for feature extraction from MRI images. Our results reveal that SVM with watershed features achieves an impressive accuracy of 96.25%, surpassing other classification methods. The ADNI dataset is utilized to evaluate the effectiveness of our method, and we observed that the inclusion of watershed segmentation contributes to the enhanced performance of the models.
    Textually Pretrained Speech Language Models. (arXiv:2305.13009v2 [cs.CL] UPDATED)
    Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. We make speech samples, code and models publicly available: https://pages.cs.huji.ac.il/adiyoss-lab/twist/ .
    Deep learning based Image Compression for Microscopy Images: An Empirical Study. (arXiv:2311.01352v1 [eess.IV])
    With the fast development of modern microscopes and bioimaging techniques, an unprecedentedly large amount of imaging data are being generated, stored, analyzed, and even shared through networks. The size of the data poses great challenges for current data infrastructure. One common way to reduce the data size is by image compression. This present study analyzes classic and deep learning based image compression methods, and their impact on deep learning based image processing models. Deep learning based label-free prediction models (i.e., predicting fluorescent images from bright field images) are used as an example application for comparison and analysis. Effective image compression methods could help reduce the data size significantly without losing necessary information, and therefore reduce the burden on data management infrastructure and permit fast transmission through the network for data sharing or cloud computing. To compress images in such a wanted way, multiple classical lossy image compression techniques are compared to several AI-based compression models provided by and trained with the CompressAI toolbox using python. These different compression techniques are compared in compression ratio, multiple image similarity measures and, most importantly, the prediction accuracy from label-free models on compressed images. We found that AI-based compression techniques largely outperform the classic ones and will minimally affect the downstream label-free task in 2D cases. In the end, we hope the present study could shed light on the potential of deep learning based image compression and the impact of image compression on downstream deep learning based image analysis models.
    Exploration noise for learning linear-quadratic mean field games. (arXiv:2107.00839v2 [math.OC] UPDATED)
    The goal of this paper is to demonstrate that common noise may serve as an exploration noise for learning the solution of a mean field game. This concept is here exemplified through a toy linear-quadratic model, for which a suitable form of common noise has already been proven to restore existence and uniqueness. We here go one step further and prove that the same form of common noise may force the convergence of the learning algorithm called `fictitious play', and this without any further potential or monotone structure. Several numerical examples are provided in order to support our theoretical analysis.
    Unreading Race: Purging Protected Features from Chest X-ray Embeddings. (arXiv:2311.01349v1 [cs.LG])
    Purpose: To analyze and remove protected feature effects in chest radiograph embeddings of deep learning models. Materials and Methods: An orthogonalization is utilized to remove the influence of protected features (e.g., age, sex, race) in chest radiograph embeddings, ensuring feature-independent results. To validate the efficacy of the approach, we retrospectively study the MIMIC and CheXpert datasets using three pre-trained models, namely a supervised contrastive, a self-supervised contrastive, and a baseline classifier model. Our statistical analysis involves comparing the original versus the orthogonalized embeddings by estimating protected feature influences and evaluating the ability to predict race, age, or sex using the two types of embeddings. Results: Our experiments reveal a significant influence of protected features on predictions of pathologies. Applying orthogonalization removes these feature effects. Apart from removing any influence on pathology classification, while maintaining competitive predictive performance, orthogonalized embeddings further make it infeasible to directly predict protected attributes and mitigate subgroup disparities. Conclusion: The presented work demonstrates the successful application and evaluation of the orthogonalization technique in the domain of chest X-ray classification.
    Fraud Analytics Using Machine-learning & Engineering on Big Data (FAME) for Telecom. (arXiv:2311.00724v1 [cs.LG])
    Telecom industries lose globally 46.3 Billion USD due to fraud. Data mining and machine learning techniques (apart from rules oriented approach) have been used in past, but efficiency has been low as fraud pattern changes very rapidly. This paper presents an industrialized solution approach with self adaptive data mining technique and application of big data technologies to detect fraud and discover novel fraud patterns in accurate, efficient and cost effective manner. Solution has been successfully demonstrated to detect International Revenue Share Fraud with <5% false positive. More than 1 Terra Bytes of Call Detail Record from a reputed wholesale carrier and overseas telecom transit carrier has been used to conduct this study.
    Dynamic Fair Federated Learning Based on Reinforcement Learning. (arXiv:2311.00959v1 [cs.LG])
    Federated learning enables a collaborative training and optimization of global models among a group of devices without sharing local data samples. However, the heterogeneity of data in federated learning can lead to unfair representation of the global model across different devices. To address the fairness issue in federated learning, we propose a dynamic q fairness federated learning algorithm with reinforcement learning, called DQFFL. DQFFL aims to mitigate the discrepancies in device aggregation and enhance the fairness of treatment for all groups involved in federated learning. To quantify fairness, DQFFL leverages the performance of the global federated model on each device and incorporates {\alpha}-fairness to transform the preservation of fairness during federated aggregation into the distribution of client weights in the aggregation process. Considering the sensitivity of parameters in measuring fairness, we propose to utilize reinforcement learning for dynamic parameters during aggregation. Experimental results demonstrate that our DQFFL outperforms the state-of-the-art methods in terms of overall performance, fairness and convergence speed.
    Towards Evaluating Transfer-based Attacks Systematically, Practically, and Fairly. (arXiv:2311.01323v1 [cs.LG])
    The adversarial vulnerability of deep neural networks (DNNs) has drawn great attention due to the security risk of applying these models in real-world applications. Based on transferability of adversarial examples, an increasing number of transfer-based methods have been developed to fool black-box DNN models whose architecture and parameters are inaccessible. Although tremendous effort has been exerted, there still lacks a standardized benchmark that could be taken advantage of to compare these methods systematically, fairly, and practically. Our investigation shows that the evaluation of some methods needs to be more reasonable and more thorough to verify their effectiveness, to avoid, for example, unfair comparison and insufficient consideration of possible substitute/victim models. Therefore, we establish a transfer-based attack benchmark (TA-Bench) which implements 30+ methods. In this paper, we evaluate and compare them comprehensively on 25 popular substitute/victim models on ImageNet. New insights about the effectiveness of these methods are gained and guidelines for future evaluations are provided. Code at: https://github.com/qizhangli/TA-Bench.
    Improving Adversarial Transferability via Intermediate-level Perturbation Decay. (arXiv:2304.13410v3 [cs.LG] UPDATED)
    Intermediate-level attacks that attempt to perturb feature representations following an adversarial direction drastically have shown favorable performance in crafting transferable adversarial examples. Existing methods in this category are normally formulated with two separate stages, where a directional guide is required to be determined at first and the scalar projection of the intermediate-level perturbation onto the directional guide is enlarged thereafter. The obtained perturbation deviates from the guide inevitably in the feature space, and it is revealed in this paper that such a deviation may lead to sub-optimal attack. To address this issue, we develop a novel intermediate-level method that crafts adversarial examples within a single stage of optimization. In particular, the proposed method, named intermediate-level perturbation decay (ILPD), encourages the intermediate-level perturbation to be in an effective adversarial direction and to possess a great magnitude simultaneously. In-depth discussion verifies the effectiveness of our method. Experimental results show that it outperforms state-of-the-arts by large margins in attacking various victim models on ImageNet (+10.07% on average) and CIFAR-10 (+3.88% on average). Our code is at https://github.com/qizhangli/ILPD-attack.
    Holistic Transfer: Towards Non-Disruptive Fine-Tuning with Partial Target Data. (arXiv:2311.01420v1 [cs.LG])
    We propose a learning problem involving adapting a pre-trained source model to the target domain for classifying all classes that appeared in the source data, using target data that covers only a partial label space. This problem is practical, as it is unrealistic for the target end-users to collect data for all classes prior to adaptation. However, it has received limited attention in the literature. To shed light on this issue, we construct benchmark datasets and conduct extensive experiments to uncover the inherent challenges. We found a dilemma -- on the one hand, adapting to the new target domain is important to claim better performance; on the other hand, we observe that preserving the classification accuracy of classes missing in the target adaptation data is highly challenging, let alone improving them. To tackle this, we identify two key directions: 1) disentangling domain gradients from classification gradients, and 2) preserving class relationships. We present several effective solutions that maintain the accuracy of the missing classes and enhance the overall performance, establishing solid baselines for holistic transfer of pre-trained models with partial target data.
    Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis. (arXiv:2311.01052v1 [stat.ML])
    We introduce Resilient Multiple Choice Learning (rMCL), an extension of the MCL approach for conditional distribution estimation in regression settings where multiple targets may be sampled for each training input. Multiple Choice Learning is a simple framework to tackle multimodal density estimation, using the Winner-Takes-All (WTA) loss for a set of hypotheses. In regression settings, the existing MCL variants focus on merging the hypotheses, thereby eventually sacrificing the diversity of the predictions. In contrast, our method relies on a novel learned scoring scheme underpinned by a mathematical framework based on Voronoi tessellations of the output space, from which we can derive a probabilistic interpretation. After empirically validating rMCL with experiments on synthetic data, we further assess its merits on the sound source localization problem, demonstrating its practical usefulness and the relevance of its interpretation.
    Scattering Vision Transformer: Spectral Mixing Matters. (arXiv:2311.01310v1 [cs.CV])
    Vision transformers have gained significant attention and achieved state-of-the-art performance in various computer vision tasks, including image classification, instance segmentation, and object detection. However, challenges remain in addressing attention complexity and effectively capturing fine-grained information within images. Existing solutions often resort to down-sampling operations, such as pooling, to reduce computational cost. Unfortunately, such operations are non-invertible and can result in information loss. In this paper, we present a novel approach called Scattering Vision Transformer (SVT) to tackle these challenges. SVT incorporates a spectrally scattering network that enables the capture of intricate image details. SVT overcomes the invertibility issue associated with down-sampling operations by separating low-frequency and high-frequency components. Furthermore, SVT introduces a unique spectral gating network utilizing Einstein multiplication for token and channel mixing, effectively reducing complexity. We show that SVT achieves state-of-the-art performance on the ImageNet dataset with a significant reduction in a number of parameters and FLOPS. SVT shows 2\% improvement over LiTv2 and iFormer. SVT-H-S reaches 84.2\% top-1 accuracy, while SVT-H-B reaches 85.2\% (state-of-art for base versions) and SVT-H-L reaches 85.7\% (again state-of-art for large versions). SVT also shows comparable results in other vision tasks such as instance segmentation. SVT also outperforms other transformers in transfer learning on standard datasets such as CIFAR10, CIFAR100, Oxford Flower, and Stanford Car datasets. The project page is available on this webpage.\url{https://badripatro.github.io/svt/}.
    Deep Double Descent for Time Series Forecasting: Avoiding Undertrained Models. (arXiv:2311.01442v1 [cs.LG])
    Deep learning models, particularly Transformers, have achieved impressive results in various domains, including time series forecasting. While existing time series literature primarily focuses on model architecture modifications and data augmentation techniques, this paper explores the training schema of deep learning models for time series; how models are trained regardless of their architecture. We perform extensive experiments to investigate the occurrence of deep double descent in several Transformer models trained on public time series data sets. We demonstrate epoch-wise deep double descent and that overfitting can be reverted using more epochs. Leveraging these findings, we achieve state-of-the-art results for long sequence time series forecasting in nearly 70% of the 72 benchmarks tested. This suggests that many models in the literature may possess untapped potential. Additionally, we introduce a taxonomy for classifying training schema modifications, covering data augmentation, model inputs, model targets, time series per model, and computational budget.
    SensorSCAN: Self-Supervised Learning and Deep Clustering for Fault Diagnosis in Chemical Processes. (arXiv:2208.08879v2 [cs.LG] UPDATED)
    Modern industrial facilities generate large volumes of raw sensor data during the production process. This data is used to monitor and control the processes and can be analyzed to detect and predict process abnormalities. Typically, the data has to be annotated by experts in order to be used in predictive modeling. However, manual annotation of large amounts of data can be difficult in industrial settings. In this paper, we propose SensorSCAN, a novel method for unsupervised fault detection and diagnosis, designed for industrial chemical process monitoring. We demonstrate our model's performance on two publicly available datasets of the Tennessee Eastman Process with various faults. The results show that our method significantly outperforms existing approaches (+0.2-0.3 TPR for a fixed FPR) and effectively detects most of the process faults without expert annotation. Moreover, we show that the model fine-tuned on a small fraction of labeled data nearly reaches the performance of a SOTA model trained on the full dataset. We also demonstrate that our method is suitable for real-world applications where the number of faults is not known in advance. The code is available at https://github.com/AIRI-Institute/sensorscan.
    Atlas-Based Interpretable Age Prediction In Whole-Body MR Images. (arXiv:2307.07439v3 [eess.IV] UPDATED)
    Age prediction is an important part of medical assessments and research. It can aid in detecting diseases as well as abnormal ageing by highlighting the discrepancy between chronological and biological age. To gain a comprehensive understanding of age-related changes observed in various body parts, we investigate them on a larger scale by using whole-body 3D images. We utilise the Grad-CAM interpretability method to determine the body areas most predictive of a person's age. We expand our analysis beyond individual subjects by employing registration techniques to generate population-wide interpretability maps. Our findings reveal three primary areas of interest: the spine, the autochthonous back muscles, and the cardiac region, which exhibits the highest importance.
    Offline Imitation from Observation via Primal Wasserstein State Occupancy Matching. (arXiv:2311.01331v1 [cs.LG])
    In real-world scenarios, arbitrary interactions with the environment can often be costly, and actions of expert demonstrations are not always available. To reduce the need for both, Offline Learning from Observations (LfO) is extensively studied, where the agent learns to solve a task with only expert states and \textit{task-agnostic} non-expert state-action pairs. The state-of-the-art DIstribution Correction Estimation (DICE) methods minimize the state occupancy divergence between the learner and expert policies. However, they are limited to either $f$-divergences (KL and $\chi^2$) or Wasserstein distance with Rubinstein duality, the latter of which constrains the underlying distance metric crucial to the performance of Wasserstein-based solutions. To address this problem, we propose Primal Wasserstein DICE (PW-DICE), which minimizes the primal Wasserstein distance between the expert and learner state occupancies with a pessimistic regularizer and leverages a contrastively learned distance as the underlying metric for the Wasserstein distance. Theoretically, we prove that our framework is a generalization of the state-of-the-art, SMODICE, and unifies $f$-divergence and Wasserstein minimization. Empirically, we find that PW-DICE improves upon several state-of-the-art methods on multiple testbeds.
    BiSLS/SPS: Auto-tune Step Sizes for Stable Bi-level Optimization. (arXiv:2305.18666v2 [cs.LG] UPDATED)
    The popularity of bi-level optimization (BO) in deep learning has spurred a growing interest in studying gradient-based BO algorithms. However, existing algorithms involve two coupled learning rates that can be affected by approximation errors when computing hypergradients, making careful fine-tuning necessary to ensure fast convergence. To alleviate this issue, we investigate the use of recently proposed adaptive step-size methods, namely stochastic line search (SLS) and stochastic Polyak step size (SPS), for computing both the upper and lower-level learning rates. First, we revisit the use of SLS and SPS in single-level optimization without the additional interpolation condition that is typically assumed in prior works. For such settings, we investigate new variants of SLS and SPS that improve upon existing suggestions in the literature and are simpler to implement. Importantly, these two variants can be seen as special instances of general family of methods with an envelope-type step-size. This unified envelope strategy allows for the extension of the algorithms and their convergence guarantees to BO settings. Finally, our extensive experiments demonstrate that the new algorithms, which are available in both SGD and Adam versions, can find large learning rates with minimal tuning and converge faster than corresponding vanilla SGD or Adam BO algorithms that require fine-tuning.
    Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion. (arXiv:2311.01017v1 [cs.CV])
    Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, we recast Masked Generative Image Transformer into the discrete diffusion framework with a few simple changes, resulting in notable improvement. When applied to learning world models on point cloud observations, our model reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotic agents.
    AeroPath: An airway segmentation benchmark dataset with challenging pathology. (arXiv:2311.01138v1 [cs.CV])
    To improve the prognosis of patients suffering from pulmonary diseases, such as lung cancer, early diagnosis and treatment are crucial. The analysis of CT images is invaluable for diagnosis, whereas high quality segmentation of the airway tree are required for intervention planning and live guidance during bronchoscopy. Recently, the Multi-domain Airway Tree Modeling (ATM'22) challenge released a large dataset, both enabling training of deep-learning based models and bringing substantial improvement of the state-of-the-art for the airway segmentation task. However, the ATM'22 dataset includes few patients with severe pathologies affecting the airway tree anatomy. In this study, we introduce a new public benchmark dataset (AeroPath), consisting of 27 CT images from patients with pathologies ranging from emphysema to large tumors, with corresponding trachea and bronchi annotations. Second, we present a multiscale fusion design for automatic airway segmentation. Models were trained on the ATM'22 dataset, tested on the AeroPath dataset, and further evaluated against competitive open-source methods. The same performance metrics as used in the ATM'22 challenge were used to benchmark the different considered approaches. Lastly, an open web application is developed, to easily test the proposed model on new data. The results demonstrated that our proposed architecture predicted topologically correct segmentations for all the patients included in the AeroPath dataset. The proposed method is robust and able to handle various anomalies, down to at least the fifth airway generation. In addition, the AeroPath dataset, featuring patients with challenging pathologies, will contribute to development of new state-of-the-art methods. The AeroPath dataset and the web application are made openly available.
    In Defense of Softmax Parametrization for Calibrated and Consistent Learning to Defer. (arXiv:2311.01106v1 [cs.LG])
    Enabling machine learning classifiers to defer their decision to a downstream expert when the expert is more accurate will ensure improved safety and performance. This objective can be achieved with the learning-to-defer framework which aims to jointly learn how to classify and how to defer to the expert. In recent studies, it has been theoretically shown that popular estimators for learning to defer parameterized with softmax provide unbounded estimates for the likelihood of deferring which makes them uncalibrated. However, it remains unknown whether this is due to the widely used softmax parameterization and if we can find a softmax-based estimator that is both statistically consistent and possesses a valid probability estimator. In this work, we first show that the cause of the miscalibrated and unbounded estimator in prior literature is due to the symmetric nature of the surrogate losses used and not due to softmax. We then propose a novel statistically consistent asymmetric softmax-based surrogate loss that can produce valid estimates without the issue of unboundedness. We further analyze the non-asymptotic properties of our method and empirically validate its performance and calibration on benchmark datasets.
    Sequence Modeling with Multiresolution Convolutional Memory. (arXiv:2305.01638v2 [cs.LG] UPDATED)
    Efficiently capturing the long-range patterns in sequential data sources salient to a given task -- such as classification and generative modeling -- poses a fundamental challenge. Popular approaches in the space tradeoff between the memory burden of brute-force enumeration and comparison, as in transformers, the computational burden of complicated sequential dependencies, as in recurrent neural networks, or the parameter burden of convolutional networks with many or large filters. We instead take inspiration from wavelet-based multiresolution analysis to define a new building block for sequence modeling, which we call a MultiresLayer. The key component of our model is the multiresolution convolution, capturing multiscale trends in the input sequence. Our MultiresConv can be implemented with shared filters across a dilated causal convolution tree. Thus it garners the computational advantages of convolutional networks and the principled theoretical motivation of wavelet decompositions. Our MultiresLayer is straightforward to implement, requires significantly fewer parameters, and maintains at most a $\mathcal{O}(N\log N)$ memory footprint for a length $N$ sequence. Yet, by stacking such layers, our model yields state-of-the-art performance on a number of sequence classification and autoregressive density estimation tasks using CIFAR-10, ListOps, and PTB-XL datasets.
    Learning Realistic Traffic Agents in Closed-loop. (arXiv:2311.01394v1 [cs.RO])
    Realistic traffic simulation is crucial for developing self-driving software in a safe and scalable manner prior to real-world deployment. Typically, imitation learning (IL) is used to learn human-like traffic agents directly from real-world observations collected offline, but without explicit specification of traffic rules, agents trained from IL alone frequently display unrealistic infractions like collisions and driving off the road. This problem is exacerbated in out-of-distribution and long-tail scenarios. On the other hand, reinforcement learning (RL) can train traffic agents to avoid infractions, but using RL alone results in unhuman-like driving behaviors. We propose Reinforcing Traffic Rules (RTR), a holistic closed-loop learning objective to match expert demonstrations under a traffic compliance constraint, which naturally gives rise to a joint IL + RL approach, obtaining the best of both worlds. Our method learns in closed-loop simulations of both nominal scenarios from real-world datasets as well as procedurally generated long-tail scenarios. Our experiments show that RTR learns more realistic and generalizable traffic simulation policies, achieving significantly better tradeoffs between human-like driving and traffic compliance in both nominal and long-tail scenarios. Moreover, when used as a data generation tool for training prediction models, our learned traffic policy leads to considerably improved downstream prediction metrics compared to baseline traffic agents. For more information, visit the project website: https://waabi.ai/rtr
    Empathy Detection Using Machine Learning on Text, Audiovisual, Audio or Physiological Signals. (arXiv:2311.00721v1 [cs.HC])
    Empathy is a social skill that indicates an individual's ability to understand others. Over the past few years, empathy has drawn attention from various disciplines, including but not limited to Affective Computing, Cognitive Science and Psychology. Empathy is a context-dependent term; thus, detecting or recognising empathy has potential applications in society, healthcare and education. Despite being a broad and overlapping topic, the avenue of empathy detection studies leveraging Machine Learning remains underexplored from a holistic literature perspective. To this end, we systematically collect and screen 801 papers from 10 well-known databases and analyse the selected 54 papers. We group the papers based on input modalities of empathy detection systems, i.e., text, audiovisual, audio and physiological signals. We examine modality-specific pre-processing and network architecture design protocols, popular dataset descriptions and availability details, and evaluation protocols. We further discuss the potential applications, deployment challenges and research gaps in the Affective Computing-based empathy domain, which can facilitate new avenues of exploration. We believe that our work is a stepping stone to developing a privacy-preserving and unbiased empathic system inclusive of culture, diversity and multilingualism that can be deployed in practice to enhance the overall well-being of human life.
    A Deep Learning algorithm to accelerate Algebraic Multigrid methods in Finite Element solvers of 3D elliptic PDEs. (arXiv:2304.10832v3 [math.NA] UPDATED)
    Algebraic multigrid (AMG) methods are among the most efficient solvers for linear systems of equations and they are widely used for the solution of problems stemming from the discretization of Partial Differential Equations (PDEs). The most severe limitation of AMG methods is the dependence on parameters that require to be fine-tuned. In particular, the strong threshold parameter is the most relevant since it stands at the basis of the construction of successively coarser grids needed by the AMG methods. We introduce a novel Deep Learning algorithm that minimizes the computational cost of the AMG method when used as a finite element solver. We show that our algorithm requires minimal changes to any existing code. The proposed Artificial Neural Network (ANN) tunes the value of the strong threshold parameter by interpreting the sparse matrix of the linear system as a black-and-white image and exploiting a pooling operator to transform it into a small multi-channel image. We experimentally prove that the pooling successfully reduces the computational cost of processing a large sparse matrix and preserves the features needed for the regression task at hand. We train the proposed algorithm on a large dataset containing problems with a highly heterogeneous diffusion coefficient defined in different three-dimensional geometries and discretized with unstructured grids and linear elasticity problems with a highly heterogeneous Young's modulus. When tested on problems with coefficients or geometries not present in the training dataset, our approach reduces the computational time by up to 30%.
    Diable: Efficient Dialogue State Tracking as Operations on Tables. (arXiv:2305.17020v3 [cs.CL] UPDATED)
    Sequence-to-sequence state-of-the-art systems for dialogue state tracking (DST) use the full dialogue history as input, represent the current state as a list with all the slots, and generate the entire state from scratch at each dialogue turn. This approach is inefficient, especially when the number of slots is large and the conversation is long. We propose Diable, a new task formalisation that simplifies the design and implementation of efficient DST systems and allows one to easily plug and play large language models. We represent the dialogue state as a table and formalise DST as a table manipulation task. At each turn, the system updates the previous state by generating table operations based on the dialogue context. Extensive experimentation on the MultiWoz datasets demonstrates that Diable (i) outperforms strong efficient DST baselines, (ii) is 2.4x more time efficient than current state-of-the-art methods while retaining competitive Joint Goal Accuracy, and (iii) is robust to noisy data annotations due to the table operations approach.
    Can Large Language Models Design Accurate Label Functions?. (arXiv:2311.00739v1 [cs.CL])
    Programmatic weak supervision methodologies facilitate the expedited labeling of extensive datasets through the use of label functions (LFs) that encapsulate heuristic data sources. Nonetheless, the creation of precise LFs necessitates domain expertise and substantial endeavors. Recent advances in pre-trained language models (PLMs) have exhibited substantial potential across diverse tasks. However, the capacity of PLMs to autonomously formulate accurate LFs remains an underexplored domain. In this research, we address this gap by introducing DataSculpt, an interactive framework that harnesses PLMs for the automated generation of LFs. Within DataSculpt, we incorporate an array of prompting techniques, instance selection strategies, and LF filtration methods to explore the expansive design landscape. Ultimately, we conduct a thorough assessment of DataSculpt's performance on 12 real-world datasets, encompassing a range of tasks. This evaluation unveils both the strengths and limitations of contemporary PLMs in LF design.
    Gaussian Processes on Cellular Complexes. (arXiv:2311.01198v1 [cs.LG])
    In recent years, there has been considerable interest in developing machine learning models on graphs in order to account for topological inductive biases. In particular, recent attention was given to Gaussian processes on such structures since they can additionally account for uncertainty. However, graphs are limited to modelling relations between two vertices. In this paper, we go beyond this dyadic setting and consider polyadic relations that include interactions between vertices, edges and one of their generalisations, known as cells. Specifically, we propose Gaussian processes on cellular complexes, a generalisation of graphs that captures interactions between these higher-order cells. One of our key contributions is the derivation of two novel kernels, one that generalises the graph Mat\'ern kernel and one that additionally mixes information of different cell types.
    Representation Equivalent Neural Operators: a Framework for Alias-free Operator Learning. (arXiv:2305.19913v2 [cs.LG] UPDATED)
    Recently, operator learning, or learning mappings between infinite-dimensional function spaces, has garnered significant attention, notably in relation to learning partial differential equations from data. Conceptually clear when outlined on paper, neural operators necessitate discretization in the transition to computer implementations. This step can compromise their integrity, often causing them to deviate from the underlying operators. This research offers a fresh take on neural operators with a framework Representation equivalent Neural Operators (ReNO) designed to address these issues. At its core is the concept of operator aliasing, which measures inconsistency between neural operators and their discrete representations. We explore this for widely-used operator learning techniques. Our findings detail how aliasing introduces errors when handling different discretizations and grids and loss of crucial continuous structures. More generally, this framework not only sheds light on existing challenges but, given its constructive and broad nature, also potentially offers tools for developing new neural operators.
    Gaussian Process Priors for Systems of Linear Partial Differential Equations with Constant Coefficients. (arXiv:2212.14319v4 [stat.ML] UPDATED)
    Partial differential equations (PDEs) are important tools to model physical systems and including them into machine learning models is an important way of incorporating physical knowledge. Given any system of linear PDEs with constant coefficients, we propose a family of Gaussian process (GP) priors, which we call EPGP, such that all realizations are exact solutions of this system. We apply the Ehrenpreis-Palamodov fundamental principle, which works as a non-linear Fourier transform, to construct GP kernels mirroring standard spectral methods for GPs. Our approach can infer probable solutions of linear PDE systems from any data such as noisy measurements, or pointwise defined initial and boundary conditions. Constructing EPGP-priors is algorithmic, generally applicable, and comes with a sparse version (S-EPGP) that learns the relevant spectral frequencies and works better for big data sets. We demonstrate our approach on three families of systems of PDEs, the heat equation, wave equation, and Maxwell's equations, where we improve upon the state of the art in computation time and precision, in some experiments by several orders of magnitude.
    tmn at #SMM4H 2023: Comparing Text Preprocessing Techniques for Detecting Tweets Self-reporting a COVID-19 Diagnosis. (arXiv:2311.00732v1 [cs.CL])
    The paper describes a system developed for Task 1 at SMM4H 2023. The goal of the task is to automatically distinguish tweets that self-report a COVID-19 diagnosis (for example, a positive test, clinical diagnosis, or hospitalization) from those that do not. We investigate the use of different techniques for preprocessing tweets using four transformer-based models. The ensemble of fine-tuned language models obtained an F1-score of 84.5%, which is 4.1% higher than the average value.
    GIST: Generated Inputs Sets Transferability in Deep Learning. (arXiv:2311.00801v1 [cs.LG])
    As the demand for verifiability and testability of neural networks continues to rise, an increasing number of methods for generating test sets are being developed. However, each of these techniques tends to emphasize specific testing aspects and can be quite time-consuming. A straightforward solution to mitigate this issue is to transfer test sets between some benchmarked models and a new model under test, based on a desirable property one wishes to transfer. This paper introduces GIST (Generated Inputs Sets Transferability), a novel approach for the efficient transfer of test sets among Deep Learning models. Given a property of interest that a user wishes to transfer (e.g., coverage criterion), GIST enables the selection of good test sets from the point of view of this property among available ones from a benchmark. We empirically evaluate GIST on fault types coverage property with two modalities and different test set generation procedures to demonstrate the approach's feasibility. Experimental results show that GIST can select an effective test set for the given property to transfer it to the model under test. Our results suggest that GIST could be applied to transfer other properties and could generalize to different test sets' generation procedures and modalities
    Time-series Generation by Contrastive Imitation. (arXiv:2311.01388v1 [stat.ML])
    Consider learning a generative model for time-series data. The sequential setting poses a unique challenge: Not only should the generator capture the conditional dynamics of (stepwise) transitions, but its open-loop rollouts should also preserve the joint distribution of (multi-step) trajectories. On one hand, autoregressive models trained by MLE allow learning and computing explicit transition distributions, but suffer from compounding error during rollouts. On the other hand, adversarial models based on GAN training alleviate such exposure bias, but transitions are implicit and hard to assess. In this work, we study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy, where the reinforcement signal is provided by a global (but stepwise-decomposable) energy model trained by contrastive estimation. At training, the two components are learned cooperatively, avoiding the instabilities typical of adversarial objectives. At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality. By expressly training a policy to imitate sequential behavior of time-series features in a dataset, this approach embodies "generation by imitation". Theoretically, we illustrate the correctness of this formulation and the consistency of the algorithm. Empirically, we evaluate its ability to generate predictively useful samples from real-world datasets, verifying that it performs at the standard of existing benchmarks.
    Mahalanobis-Aware Training for Out-of-Distribution Detection. (arXiv:2311.00808v1 [cs.LG])
    While deep learning models have seen widespread success in controlled environments, there are still barriers to their adoption in open-world settings. One critical task for safe deployment is the detection of anomalous or out-of-distribution samples that may require human intervention. In this work, we present a novel loss function and recipe for training networks with improved density-based out-of-distribution sensitivity. We demonstrate the effectiveness of our method on CIFAR-10, notably reducing the false-positive rate of the relative Mahalanobis distance method on far-OOD tasks by over 50%.
    Electronic excited states from physically-constrained machine learning. (arXiv:2311.00844v1 [physics.chem-ph])
    Data-driven techniques are increasingly used to replace electronic-structure calculations of matter. In this context, a relevant question is whether machine learning (ML) should be applied directly to predict the desired properties or be combined explicitly with physically-grounded operations. We present an example of an integrated modeling approach, in which a symmetry-adapted ML model of an effective Hamiltonian is trained to reproduce electronic excitations from a quantum-mechanical calculation. The resulting model can make predictions for molecules that are much larger and more complex than those that it is trained on, and allows for dramatic computational savings by indirectly targeting the outputs of well-converged calculations while using a parameterization corresponding to a minimal atom-centered basis. These results emphasize the merits of intertwining data-driven techniques with physical approximations, improving the transferability and interpretability of ML models without affecting their accuracy and computational efficiency, and providing a blueprint for developing ML-augmented electronic-structure methods.
    Learning Collective Behaviors from Observation. (arXiv:2311.00875v1 [cs.LG])
    We present a review of a series of learning methods used to identify the structure of dynamical systems, aiming to understand emergent behaviors in complex systems of interacting agents. These methods not only offer theoretical guarantees of convergence but also demonstrate computational efficiency in handling high-dimensional observational data. They can manage observation data from both first- and second-order dynamical systems, accounting for observation/stochastic noise, complex interaction rules, missing interaction features, and real-world observations of interacting agent systems. The essence of developing such a series of learning methods lies in designing appropriate loss functions using the variational inverse problem approach, which inherently provides dimension reduction capabilities to our learning methods.
    Long-Range Neural Atom Learning for Molecular Graphs. (arXiv:2311.01276v1 [cs.LG])
    Graph Neural Networks (GNNs) have been widely adopted for drug discovery with molecular graphs. Nevertheless, current GNNs are mainly good at leveraging short-range interactions (SRI) but struggle to capture long-range interactions (LRI), both of which are crucial for determining molecular properties. To tackle this issue, we propose a method that implicitly projects all original atoms into a few Neural Atoms, which abstracts the collective information of atomic groups within a molecule. Specifically, we explicitly exchange the information among neural atoms and project them back to the atoms' representations as an enhancement. With this mechanism, neural atoms establish the communication channels among distant nodes, effectively reducing the interaction scope of arbitrary node pairs into a single hop. To provide an inspection of our method from a physical perspective, we reveal its connection with the traditional LRI calculation method, Ewald Summation. We conduct extensive experiments on three long-range graph benchmarks, covering both graph-level and link-level tasks on molecular graphs. We empirically justify that our method can be equipped with an arbitrary GNN and help to capture LRI.
    SmoothHess: ReLU Network Feature Interactions via Stein's Lemma. (arXiv:2311.00858v1 [cs.LG])
    Several recent methods for interpretability model feature interactions by looking at the Hessian of a neural network. This poses a challenge for ReLU networks, which are piecewise-linear and thus have a zero Hessian almost everywhere. We propose SmoothHess, a method of estimating second-order interactions through Stein's Lemma. In particular, we estimate the Hessian of the network convolved with a Gaussian through an efficient sampling algorithm, requiring only network gradient calls. SmoothHess is applied post-hoc, requires no modifications to the ReLU network architecture, and the extent of smoothing can be controlled explicitly. We provide a non-asymptotic bound on the sample complexity of our estimation procedure. We validate the superior ability of SmoothHess to capture interactions on benchmark datasets and a real-world medical spirometry dataset.
    Optimizing Inventory Routing: A Decision-Focused Learning Approach using Neural Networks. (arXiv:2311.00983v1 [cs.LG])
    Inventory Routing Problem (IRP) is a crucial challenge in supply chain management as it involves optimizing efficient route selection while considering the uncertainty of inventory demand planning. To solve IRPs, usually a two-stage approach is employed, where demand is predicted using machine learning techniques first, and then an optimization algorithm is used to minimize routing costs. Our experiment shows machine learning models fall short of achieving perfect accuracy because inventory levels are influenced by the dynamic business environment, which, in turn, affects the optimization problem in the next stage, resulting in sub-optimal decisions. In this paper, we formulate and propose a decision-focused learning-based approach to solving real-world IRPs. This approach directly integrates inventory prediction and routing optimization within an end-to-end system potentially ensuring a robust supply chain strategy.
    Tailoring Mixup to Data using Kernel Warping functions. (arXiv:2311.01434v1 [cs.LG])
    Data augmentation is an essential building block for learning efficient deep learning models. Among all augmentation techniques proposed so far, linear interpolation of training data points, also called mixup, has found to be effective for a large panel of applications. While the majority of works have focused on selecting the right points to mix, or applying complex non-linear interpolation, we are interested in mixing similar points more frequently and strongly than less similar ones. To this end, we propose to dynamically change the underlying distribution of interpolation coefficients through warping functions, depending on the similarity between data points to combine. We define an efficient and flexible framework to do so without losing in diversity. We provide extensive experiments for classification and regression tasks, showing that our proposed method improves both performance and calibration of models. Code available in https://github.com/ENSTA-U2IS/torch-uncertainty
    Fitted Value Iteration Methods for Bicausal Optimal Transport. (arXiv:2306.12658v2 [stat.ML] UPDATED)
    We develop a fitted value iteration (FVI) method to compute bicausal optimal transport (OT) where couplings have an adapted structure. Based on the dynamic programming formulation, FVI adopts a function class to approximate the value functions in bicausal OT. Under the concentrability condition and approximate completeness assumption, we prove the sample complexity using (local) Rademacher complexity. Furthermore, we demonstrate that multilayer neural networks with appropriate structures satisfy the crucial assumptions required in sample complexity proofs. Numerical experiments reveal that FVI outperforms linear programming and adapted Sinkhorn methods in scalability as the time horizon increases, while still maintaining acceptable accuracy.
    A Finite-Particle Convergence Rate for Stein Variational Gradient Descent. (arXiv:2211.09721v5 [cs.LG] UPDATED)
    We provide the first finite-particle convergence rate for Stein variational gradient descent (SVGD), a popular algorithm for approximating a probability distribution with a collection of particles. Specifically, whenever the target distribution is sub-Gaussian with a Lipschitz score, SVGD with n particles and an appropriate step size sequence drives the kernel Stein discrepancy to zero at an order 1/sqrt(log log n) rate. We suspect that the dependence on n can be improved, and we hope that our explicit, non-asymptotic proof strategy will serve as a template for future refinements.
    Towards Safe Propofol Dosing during General Anesthesia Using Deep Offline Reinforcement Learning. (arXiv:2303.10180v2 [cs.LG] UPDATED)
    Automated anesthesia promises to enable more precise and personalized anesthetic administration and free anesthesiologists from repetitive tasks, allowing them to focus on the most critical aspects of a patient's surgical care. Current research has typically focused on creating simulated environments from which agents can learn. These approaches have demonstrated good experimental results, but are still far from clinical application. In this paper, Policy Constraint Q-Learning (PCQL), a data-driven reinforcement learning algorithm for solving the problem of learning anesthesia strategies on real clinical datasets, is proposed. Conservative Q-Learning was first introduced to alleviate the problem of Q function overestimation in an offline context. A policy constraint term is added to agent training to keep the policy distribution of the agent and the anesthesiologist consistent to ensure safer decisions made by the agent in anesthesia scenarios. The effectiveness of PCQL was validated by extensive experiments on a real clinical anesthesia dataset. Experimental results show that PCQL is predicted to achieve higher gains than the baseline approach while maintaining good agreement with the reference dose given by the anesthesiologist, using less total dose, and being more responsive to the patient's vital signs. In addition, the confidence intervals of the agent were investigated, which were able to cover most of the clinical decisions of the anesthesiologist. Finally, an interpretable method, SHAP, was used to analyze the contributing components of the model predictions to increase the transparency of the model.
    VIGraph: Self-supervised Learning for Class-Imbalanced Node Classification. (arXiv:2311.01191v1 [cs.LG])
    Class imbalance in graph data poses significant challenges for node classification. Existing methods, represented by SMOTE-based approaches, partially alleviate this issue but still exhibit limitations during imbalanced scenario construction. Self-supervised learning (SSL) offers a promising solution by synthesizing minority nodes from the data itself, yet its potential remains unexplored. In this paper, we analyze the limitations of SMOTE-based approaches and introduce VIGraph, a novel SSL model based on the self-supervised Variational Graph Auto-Encoder (VGAE) that leverages Variational Inference (VI) to generate minority nodes. Specifically, VIGraph strictly adheres to the concept of imbalance when constructing imbalanced graphs and utilizes the generative VGAE to generate minority nodes. Moreover, VIGraph introduces a novel Siamese contrastive strategy at the decoding phase to improve the overall quality of generated nodes. VIGraph can generate high-quality nodes without reintegrating them into the original graph, eliminating the "Generating, Reintegrating, and Retraining" process found in SMOTE-based methods. Experiments on multiple real-world datasets demonstrate that VIGraph achieves promising results for class-imbalanced node classification tasks.
    Entropy-based Discovery of Summary Causal Graphs in Time Series. (arXiv:2105.10381v2 [cs.AI] UPDATED)
    This study addresses the problem of learning a summary causal graph on time series with potentially different sampling rates. To do so, we first propose a new causal temporal mutual information measure for time series. We then show how this measure relates to an entropy reduction principle that can be seen as a special case of the probability raising principle. We finally combine these two ingredients in PC-like and FCI-like algorithms to construct the summary causal graph. There algorithm are evaluated on several datasets, which shows both their efficacy and efficiency.
    On the Lipschitz constant of random neural networks. (arXiv:2311.01356v1 [stat.ML])
    Empirical studies have widely demonstrated that neural networks are highly sensitive to small, adversarial perturbations of the input. The worst-case robustness against these so-called adversarial examples can be quantified by the Lipschitz constant of the neural network. However, only few theoretical results regarding this quantity exist in the literature. In this paper, we initiate the study of the Lipschitz constant of random ReLU neural networks, i.e., neural networks whose weights are chosen at random and which employ the ReLU activation function. For shallow neural networks, we characterize the Lipschitz constant up to an absolute numerical constant. Moreover, we extend our analysis to deep neural networks of sufficiently large width where we prove upper and lower bounds for the Lipschitz constant. These bounds match up to a logarithmic factor that depends on the depth.
    Learning Defect Prediction from Unrealistic Data. (arXiv:2311.00931v1 [cs.LG])
    Pretrained models of code, such as CodeBERT and CodeT5, have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data, which are rarely available for downstream tasks. Instead, it has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data, however, tend to only perform well on similar data, while underperforming on real world programs. In this paper, we conjecture that this discrepancy stems from the presence of distracting samples that steer the model away from the real-world task distribution. To investigate this conjecture, we propose an approach for identifying the subsets of these large yet unrealistic datasets that are most similar to examples in real-world datasets based on their learned representations. Our approach extracts high-dimensional embeddings of both real-world and artificial programs using a neural model and scores artificial samples based on their distance to the nearest real-world sample. We show that training on only the nearest, representationally most similar samples while discarding samples that are not at all similar in representations yields consistent improvements across two popular pretrained models of code on two code understanding tasks. Our results are promising, in that they show that training models on a representative subset of an unrealistic dataset can help us harness the power of large-scale synthetic data generation while preserving downstream task performance. Finally, we highlight the limitations of applying AI models for predicting vulnerabilities and bugs in real-world applications
    Language Model Training Paradigms for Clinical Feature Embeddings. (arXiv:2311.00768v1 [cs.LG])
    In research areas with scarce data, representation learning plays a significant role. This work aims to enhance representation learning for clinical time series by deriving universal embeddings for clinical features, such as heart rate and blood pressure. We use self-supervised training paradigms for language models to learn high-quality clinical feature embeddings, achieving a finer granularity than existing time-step and patient-level representation learning. We visualize the learnt embeddings via unsupervised dimension reduction techniques and observe a high degree of consistency with prior clinical knowledge. We also evaluate the model performance on the MIMIC-III benchmark and demonstrate the effectiveness of using clinical feature embeddings. We publish our code online for replication.
    Generating QM1B with PySCF$_{\text{IPU}}$. (arXiv:2311.01135v1 [cs.LG])
    The emergence of foundation models in Computer Vision and Natural Language Processing have resulted in immense progress on downstream tasks. This progress was enabled by datasets with billions of training examples. Similar benefits are yet to be unlocked for quantum chemistry, where the potential of deep learning is constrained by comparatively small datasets with 100k to 20M training examples. These datasets are limited in size because the labels are computed using the accurate (but computationally demanding) predictions of Density Functional Theory (DFT). Notably, prior DFT datasets were created using CPU supercomputers without leveraging hardware acceleration. In this paper, we take a first step towards utilising hardware accelerators by introducing the data generator PySCF$_{\text{IPU}}$ using Intelligence Processing Units (IPUs). This allowed us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We demonstrate that a simple baseline neural network (SchNet 9M) improves its performance by simply increasing the amount of training data without additional inductive biases. To encourage future researchers to use QM1B responsibly, we highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets. Code and dataset are available on Github: this http URL
    Optimal Cost Constrained Adversarial Attacks For Multiple Agent Systems. (arXiv:2311.00859v1 [cs.LG])
    Finding optimal adversarial attack strategies is an important topic in reinforcement learning and the Markov decision process. Previous studies usually assume one all-knowing coordinator (attacker) for whom attacking different recipient (victim) agents incurs uniform costs. However, in reality, instead of using one limitless central attacker, the attacks often need to be performed by distributed attack agents. We formulate the problem of performing optimal adversarial agent-to-agent attacks using distributed attack agents, in which we impose distinct cost constraints on each different attacker-victim pair. We propose an optimal method integrating within-step static constrained attack-resource allocation optimization and between-step dynamic programming to achieve the optimal adversarial attack in a multi-agent system. Our numerical results show that the proposed attacks can significantly reduce the rewards received by the attacked agents.
    Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game. (arXiv:2311.01011v1 [cs.LG])
    While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third party prompts that subvert the intent of the system designer. To help researchers study this problem, we present a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection, all created by players of an online game called Tensor Trust. To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs. The attacks in our dataset have a lot of easily interpretable stucture, and shed light on the weaknesses of LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking. Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset. Furthermore, we show that some attack strategies from the dataset generalize to deployed LLM-based applications, even though they have a very different set of constraints to the game. We release all data and source code at https://tensortrust.ai/paper
    Time-Independent Information-Theoretic Generalization Bounds for SGLD. (arXiv:2311.01046v1 [cs.LG])
    We provide novel information-theoretic generalization bounds for stochastic gradient Langevin dynamics (SGLD) under the assumptions of smoothness and dissipativity, which are widely used in sampling and non-convex optimization studies. Our bounds are time-independent and decay to zero as the sample size increases, regardless of the number of iterations and whether the step size is fixed. Unlike previous studies, we derive the generalization error bounds by focusing on the time evolution of the Kullback--Leibler divergence, which is related to the stability of datasets and is the upper bound of the mutual information between output parameters and an input dataset. Additionally, we establish the first information-theoretic generalization bound when the training and test loss are the same by showing that a loss function of SGLD is sub-exponential. This bound is also time-independent and removes the problematic step size dependence in existing work, leading to an improved excess risk bound by combining our analysis with the existing non-convex optimization error bounds.
    SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization. (arXiv:2311.00880v1 [cs.LG])
    Incorporating safety is an essential prerequisite for broadening the practical applications of reinforcement learning in real-world scenarios. To tackle this challenge, Constrained Markov Decision Processes (CMDPs) are leveraged, which introduce a distinct cost function representing safety violations. In CMDPs' settings, Lagrangian relaxation technique has been employed in previous algorithms to convert constrained optimization problems into unconstrained dual problems. However, these algorithms may inaccurately predict unsafe behavior, resulting in instability while learning the Lagrange multiplier. This study introduces a novel safe reinforcement learning algorithm, Safety Critic Policy Optimization (SCPO). In this study, we define the safety critic, a mechanism that nullifies rewards obtained through violating safety constraints. Furthermore, our theoretical analysis indicates that the proposed algorithm can automatically balance the trade-off between adhering to safety constraints and maximizing rewards. The effectiveness of the SCPO algorithm is empirically validated by benchmarking it against strong baselines.
    When Do Graph Neural Networks Help with Node Classification? Investigating the Impact of Homophily Principle on Node Distinguishability. (arXiv:2304.14274v3 [cs.SI] UPDATED)
    Homophily principle, i.e., nodes with the same labels are more likely to be connected, has been believed to be the main reason for the performance superiority of Graph Neural Networks (GNNs) over Neural Networks on node classification tasks. Recent research suggests that, even in the absence of homophily, the advantage of GNNs still exists as long as nodes from the same class share similar neighborhood patterns. However, this argument only considers intra-class Node Distinguishability (ND) but neglects inter-class ND, which provides incomplete understanding of homophily on GNNs. In this paper, we first demonstrate such deficiency with examples and argue that an ideal situation for ND is to have smaller intra-class ND than inter-class ND. To formulate this idea and study ND deeply, we propose Contextual Stochastic Block Model for Homophily (CSBM-H) and define two metrics, Probabilistic Bayes Error (PBE) and negative generalized Jeffreys divergence, to quantify ND. With the metrics, we visualize and analyze how graph filters, node degree distributions and class variances influence ND, and investigate the combined effect of intra- and inter-class ND. Besides, we discovered the mid-homophily pitfall, which occurs widely in graph datasets. Furthermore, we verified that, in real-work tasks, the superiority of GNNs is indeed closely related to both intra- and inter-class ND regardless of homophily levels. Grounded in this observation, we propose a new hypothesis-testing based performance metric beyond homophily, which is non-linear, feature-based and can provide statistical threshold value for GNNs' the superiority. Experiments indicate that it is significantly more effective than the existing homophily metrics on revealing the advantage and disadvantage of graph-aware modes on both synthetic and benchmark real-world datasets.
    Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models. (arXiv:2311.01441v1 [cs.LG])
    We propose a conceptually simple and lightweight framework for improving the robustness of vision models through the combination of knowledge distillation and data augmentation. We address the conjecture that larger models do not make for better teachers by showing strong gains in out-of-distribution robustness when distilling from pretrained foundation models. Following this finding, we propose Discrete Adversarial Distillation (DAD), which leverages a robust teacher to generate adversarial examples and a VQGAN to discretize them, creating more informative samples than standard data augmentation techniques. We provide a theoretical framework for the use of a robust teacher in the knowledge distillation with data augmentation setting and demonstrate strong gains in out-of-distribution robustness and clean accuracy across different student architectures. Notably, our method adds minor computational overhead compared to similar techniques and can be easily combined with other data augmentations for further improvements.
    Attacking Graph Neural Networks with Bit Flips: Weisfeiler and Lehman Go Indifferent. (arXiv:2311.01205v1 [cs.LG])
    Prior attacks on graph neural networks have mostly focused on graph poisoning and evasion, neglecting the network's weights and biases. Traditional weight-based fault injection attacks, such as bit flip attacks used for convolutional neural networks, do not consider the unique properties of graph neural networks. We propose the Injectivity Bit Flip Attack, the first bit flip attack designed specifically for graph neural networks. Our attack targets the learnable neighborhood aggregation functions in quantized message passing neural networks, degrading their ability to distinguish graph structures and losing the expressivity of the Weisfeiler-Lehman test. Our findings suggest that exploiting mathematical properties specific to certain graph neural network architectures can significantly increase their vulnerability to bit flip attacks. Injectivity Bit Flip Attacks can degrade the maximal expressive Graph Isomorphism Networks trained on various graph property prediction datasets to random output by flipping only a small fraction of the network's bits, demonstrating its higher destructive power compared to a bit flip attack transferred from convolutional neural networks. Our attack is transparent and motivated by theoretical insights which are confirmed by extensive empirical results.
    Multi-Operational Mathematical Derivations in Latent Space. (arXiv:2311.01230v1 [cs.LG])
    This paper investigates the possibility of approximating multiple mathematical operations in latent space for expression derivation. To this end, we introduce different multi-operational representation paradigms, modelling mathematical operations as explicit geometric transformations. By leveraging a symbolic engine, we construct a large-scale dataset comprising 1.7M derivation steps stemming from 61K premises and 6 operators, analysing the properties of each paradigm when instantiated with state-of-the-art neural encoders. Specifically, we investigate how different encoding mechanisms can approximate equational reasoning in latent space, exploring the trade-off between learning different operators and specialising within single operations, as well as the ability to support multi-step derivations and out-of-distribution generalisation. Our empirical analysis reveals that the multi-operational paradigm is crucial for disentangling different operators, while discriminating the conclusions for a single operation is achievable in the original expression encoder. Moreover, we show that architectural choices can heavily affect the training dynamics, structural organisation, and generalisation of the latent space, resulting in significant variations across paradigms and classes of encoders.
    A quantum-classical performance separation in nonconvex optimization. (arXiv:2311.00811v1 [quant-ph])
    In this paper, we identify a family of nonconvex continuous optimization instances, each $d$-dimensional instance with $2^d$ local minima, to demonstrate a quantum-classical performance separation. Specifically, we prove that the recently proposed Quantum Hamiltonian Descent (QHD) algorithm [Leng et al., arXiv:2303.01471] is able to solve any $d$-dimensional instance from this family using $\widetilde{\mathcal{O}}(d^3)$ quantum queries to the function value and $\widetilde{\mathcal{O}}(d^4)$ additional 1-qubit and 2-qubit elementary quantum gates. On the other side, a comprehensive empirical study suggests that representative state-of-the-art classical optimization algorithms/solvers (including Gurobi) would require a super-polynomial time to solve such optimization instances.
    Multimodal Foundation Models for Zero-shot Animal Species Recognition in Camera Trap Images. (arXiv:2311.01064v1 [cs.CV])
    Due to deteriorating environmental conditions and increasing human activity, conservation efforts directed towards wildlife is crucial. Motion-activated camera traps constitute an efficient tool for tracking and monitoring wildlife populations across the globe. Supervised learning techniques have been successfully deployed to analyze such imagery, however training such techniques requires annotations from experts. Reducing the reliance on costly labelled data therefore has immense potential in developing large-scale wildlife tracking solutions with markedly less human labor. In this work we propose WildMatch, a novel zero-shot species classification framework that leverages multimodal foundation models. In particular, we instruction tune vision-language models to generate detailed visual descriptions of camera trap images using similar terminology to experts. Then, we match the generated caption to an external knowledge base of descriptions in order to determine the species in a zero-shot manner. We investigate techniques to build instruction tuning datasets for detailed animal description generation and propose a novel knowledge augmentation technique to enhance caption quality. We demonstrate the performance of WildMatch on a new camera trap dataset collected in the Magdalena Medio region of Colombia.
    Zero Coordinate Shift: Whetted Automatic Differentiation for Physics-informed Operator Learning. (arXiv:2311.00860v1 [cs.LG])
    Automatic differentiation (AD) is a critical step in physics-informed machine learning, required for computing the high-order derivatives of network output w.r.t. coordinates. In this paper, we present a novel and lightweight algorithm to conduct such AD for physics-informed operator learning, as we call the trick of Zero Coordinate Shift (ZCS). Instead of making all sampled coordinates leaf variables, ZCS introduces only one scalar-valued leaf variable for each spatial or temporal dimension, leading to a game-changing performance leap by simplifying the wanted derivatives from "many-roots-many-leaves" to "one-root-many-leaves". ZCS is easy to implement with current deep learning libraries; our own implementation is by extending the DeepXDE package. We carry out a comprehensive benchmark analysis and several case studies, training physics-informed DeepONets to solve partial differential equations (PDEs) without data. The results show that ZCS has persistently brought down GPU memory consumption and wall time for training by an order of magnitude, with the savings increasing with problem scale (i.e., number of functions, number of points and order of PDE). As a low-level optimisation, ZCS entails no restrictions on data, physics (PDEs) or network architecture and does not compromise training results from any aspect.
    Harnessing machine learning for accurate treatment of overlapping opacity species in GCMs. (arXiv:2311.00775v1 [astro-ph.EP])
    To understand high precision observations of exoplanets and brown dwarfs, we need detailed and complex general circulation models (GCMs) that incorporate hydrodynamics, chemistry, and radiation. In this study, we specifically examine the coupling between chemistry and radiation in GCMs and compare different methods for mixing opacities of different chemical species in the correlated-k assumption, when equilibrium chemistry cannot be assumed. We propose a fast machine learning method based on DeepSets (DS), which effectively combines individual correlated-k opacities (k-tables). We evaluate the DS method alongside other published methods like adaptive equivalent extinction (AEE) and random overlap with rebinning and resorting (RORR). We integrate these mixing methods into our GCM (expeRT/MITgcm) and assess their accuracy and performance for the example of the hot Jupiter HD~209458 b. Our findings indicate that the DS method is both accurate and efficient for GCM usage, whereas RORR is too slow. Additionally, we observe that the accuracy of AEE depends on its specific implementation and may introduce numerical issues in achieving radiative transfer solution convergence. We then apply the DS mixing method in a simplified chemical disequilibrium situation, where we model the rainout of TiO and VO, and confirm that the rainout of TiO and VO would hinder the formation of a stratosphere. To further expedite the development of consistent disequilibrium chemistry calculations in GCMs, we provide documentation and code for coupling the DS mixing method with correlated-k radiative transfer solvers. The DS method has been extensively tested to be accurate enough for GCMs, however, other methods might be needed for accelerating atmospheric retrievals.
    Non-Autoregressive Diffusion-based Temporal Point Processes for Continuous-Time Long-Term Event Prediction. (arXiv:2311.01033v1 [cs.LG])
    Continuous-time long-term event prediction plays an important role in many application scenarios. Most existing works rely on autoregressive frameworks to predict event sequences, which suffer from error accumulation, thus compromising prediction quality. Inspired by the success of denoising diffusion probabilistic models, we propose a diffusion-based non-autoregressive temporal point process model for long-term event prediction in continuous time. Instead of generating events one at a time in an autoregressive way, our model predicts the future event sequence entirely as a whole. In order to perform diffusion processes on event sequences, we develop a bidirectional map between target event sequences and the Euclidean vector space. Furthermore, we design a novel denoising network to capture both sequential and contextual features for better sample quality. Extensive experiments are conducted to prove the superiority of our proposed model over state-of-the-art methods on long-term event prediction in continuous time. To the best of our knowledge, this is the first work to apply diffusion methods to long-term event prediction problems.
    Scalable Counterfactual Distribution Estimation in Multivariate Causal Models. (arXiv:2311.00927v1 [stat.ML])
    We consider the problem of estimating the counterfactual joint distribution of multiple quantities of interests (e.g., outcomes) in a multivariate causal model extended from the classical difference-in-difference design. Existing methods for this task either ignore the correlation structures among dimensions of the multivariate outcome by considering univariate causal models on each dimension separately and hence produce incorrect counterfactual distributions, or poorly scale even for moderate-size datasets when directly dealing with such multivariate causal model. We propose a method that alleviates both issues simultaneously by leveraging a robust latent one-dimensional subspace of the original high-dimension space and exploiting the efficient estimation from the univariate causal model on such space. Since the construction of the one-dimensional subspace uses information from all the dimensions, our method can capture the correlation structures and produce good estimates of the counterfactual distribution. We demonstrate the advantages of our approach over existing methods on both synthetic and real-world data.
  • Open

    Federated Linear Bandits with Finite Adversarial Actions. (arXiv:2311.00973v1 [cs.LG])
    We study a federated linear bandits model, where $M$ clients communicate with a central server to solve a linear contextual bandits problem with finite adversarial action sets that may be different across clients. To address the unique challenges of adversarial finite action sets, we propose the FedSupLinUCB algorithm, which extends the principles of SupLinUCB and OFUL algorithms in linear contextual bandits. We prove that FedSupLinUCB achieves a total regret of $\tilde{O}(\sqrt{d T})$, where $T$ is the total number of arm pulls from all clients, and $d$ is the ambient dimension of the linear model. This matches the minimax lower bound and thus is order-optimal (up to polylog terms). We study both asynchronous and synchronous cases and show that the communication cost can be controlled as $O(d M^2 \log(d)\log(T))$ and $O(\sqrt{d^3 M^3} \log(d))$, respectively. The FedSupLinUCB design is further extended to two scenarios: (1) variance-adaptive, where a total regret of $\tilde{O} (\sqrt{d \sum \nolimits_{t=1}^{T} \sigma_t^2})$ can be achieved with $\sigma_t^2$ being the noise variance of round $t$; and (2) adversarial corruption, where a total regret of $\tilde{O}(\sqrt{dT} + d C_p)$ can be achieved with $C_p$ being the total corruption budget. Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of FedSupLinUCB on both synthetic and real-world datasets.
    Anonymous Learning via Look-Alike Clustering: A Precise Analysis of Model Generalization. (arXiv:2310.04015v3 [cs.LG] UPDATED)
    While personalized recommendations systems have become increasingly popular, ensuring user data protection remains a top concern in the development of these learning systems. A common approach to enhancing privacy involves training models using anonymous data rather than individual data. In this paper, we explore a natural technique called \emph{look-alike clustering}, which involves replacing sensitive features of individuals with the cluster's average values. We provide a precise analysis of how training models using anonymous cluster centers affects their generalization capabilities. We focus on an asymptotic regime where the size of the training set grows in proportion to the features dimension. Our analysis is based on the Convex Gaussian Minimax Theorem (CGMT) and allows us to theoretically understand the role of different model components on the generalization error. In addition, we demonstrate that in certain high-dimensional regimes, training over anonymous cluster centers acts as a regularization and improves generalization error of the trained models. Finally, we corroborate our asymptotic theory with finite-sample numerical experiments where we observe a perfect match when the sample size is only of order of a few hundreds.
    PPI++: Efficient Prediction-Powered Inference. (arXiv:2311.01453v1 [stat.ML])
    We present PPI++: a computationally lightweight methodology for estimation and inference based on a small labeled dataset and a typically much larger dataset of machine-learning predictions. The methods automatically adapt to the quality of available predictions, yielding easy-to-compute confidence sets -- for parameters of any dimensionality -- that always improve on classical intervals using only the labeled data. PPI++ builds on prediction-powered inference (PPI), which targets the same problem setting, improving its computational and statistical efficiency. Real and synthetic experiments demonstrate the benefits of the proposed adaptations.
    Additive Decoders for Latent Variables Identification and Cartesian-Product Extrapolation. (arXiv:2307.02598v2 [cs.LG] UPDATED)
    We tackle the problems of latent variables identification and ``out-of-support'' image generation in representation learning. We show that both are possible for a class of decoders that we call additive, which are reminiscent of decoders used for object-centric representation learning (OCRL) and well suited for images that can be decomposed as a sum of object-specific images. We provide conditions under which exactly solving the reconstruction problem using an additive decoder is guaranteed to identify the blocks of latent variables up to permutation and block-wise invertible transformations. This guarantee relies only on very weak assumptions about the distribution of the latent factors, which might present statistical dependencies and have an almost arbitrarily shaped support. Our result provides a new setting where nonlinear independent component analysis (ICA) is possible and adds to our theoretical understanding of OCRL methods. We also show theoretically that additive decoders can generate novel images by recombining observed factors of variations in novel ways, an ability we refer to as Cartesian-product extrapolation. We show empirically that additivity is crucial for both identifiability and extrapolation on simulated data.
    The Universal Statistical Structure and Scaling Laws of Chaos and Turbulence. (arXiv:2311.01358v1 [cond-mat.stat-mech])
    Turbulence is a complex spatial and temporal structure created by the strong non-linear dynamics of fluid flows at high Reynolds numbers. Despite being an ubiquitous phenomenon that has been studied for centuries, a full understanding of turbulence remained a formidable challenge. Here, we introduce tools from the fields of quantum chaos and Random Matrix Theory (RMT) and present a detailed analysis of image datasets generated from turbulence simulations of incompressible and compressible fluid flows. Focusing on two observables: the data Gram matrix and the single image distribution, we study both the local and global eigenvalue statistics and compare them to classical chaos, uncorrelated noise and natural images. We show that from the RMT perspective, the turbulence Gram matrices lie in the same universality class as quantum chaotic rather than integrable systems, and the data exhibits power-law scalings in the bulk of its eigenvalues which are vastly different from uncorrelated classical chaos, random data, natural images. Interestingly, we find that the single sample distribution only appears as fully RMT chaotic, but deviates from chaos at larger correlation lengths, as well as exhibiting different scaling properties.
    Neural Diffusion Models. (arXiv:2310.08337v1 [cs.LG] CROSS LISTED)
    Diffusion models have shown remarkable performance on many generative tasks. Despite recent success, most diffusion models are restricted in that they only allow linear transformation of the data distribution. In contrast, broader family of transformations can potentially help train generative distributions more efficiently, simplifying the reverse process and closing the gap between the true negative log-likelihood and the variational approximation. In this paper, we present Neural Diffusion Models (NDMs), a generalization of conventional diffusion models that enables defining and learning time-dependent non-linear transformations of data. We show how to optimise NDMs using a variational bound in a simulation-free setting. Moreover, we derive a time-continuous formulation of NDMs, which allows fast and reliable inference using off-the-shelf numerical ODE and SDE solvers. Finally, we demonstrate the utility of NDMs with learnable transformations through experiments on standard image generation benchmarks, including CIFAR-10, downsampled versions of ImageNet and CelebA-HQ. NDMs outperform conventional diffusion models in terms of likelihood and produce high-quality samples.
    Score-based Data Assimilation for a Two-Layer Quasi-Geostrophic Model. (arXiv:2310.01853v2 [stat.ML] UPDATED)
    Data assimilation addresses the problem of identifying plausible state trajectories of dynamical systems given noisy or incomplete observations. In geosciences, it presents challenges due to the high-dimensionality of geophysical dynamical systems, often exceeding millions of dimensions. This work assesses the scalability of score-based data assimilation (SDA), a novel data assimilation method, in the context of such systems. We propose modifications to the score network architecture aimed at significantly reducing memory consumption and execution time. We demonstrate promising results for a two-layer quasi-geostrophic model.
    Analysis of tidal flows through the Strait of Gibraltar using Dynamic Mode Decomposition. (arXiv:2311.01377v1 [math.DS])
    The Strait of Gibraltar is a region characterized by intricate oceanic sub-mesoscale features, influenced by topography, tidal forces, instabilities, and nonlinear hydraulic processes, all governed by the nonlinear equations of fluid motion. In this study, we aim to uncover the underlying physics of these phenomena within 3D MIT general circulation model simulations, including waves, eddies, and gyres. To achieve this, we employ Dynamic Mode Decomposition (DMD) to break down simulation snapshots into Koopman modes, with distinct exponential growth/decay rates and oscillation frequencies. Our objectives encompass evaluating DMD's efficacy in capturing known features, unveiling new elements, ranking modes, and exploring order reduction. We also introduce modifications to enhance DMD's robustness, numerical accuracy, and robustness of eigenvalues. DMD analysis yields a comprehensive understanding of flow patterns, internal wave formation, and the dynamics of the Strait of Gibraltar, its meandering behaviors, and the formation of a secondary gyre, notably the Western Alboran Gyre, as well as the propagation of Kelvin and coastal-trapped waves along the African coast. In doing so, it significantly advances our comprehension of intricate oceanographic phenomena and underscores the immense utility of DMD as an analytical tool for such complex datasets, suggesting that DMD could serve as a valuable addition to the toolkit of oceanographers.
    Gaussian Processes on Cellular Complexes. (arXiv:2311.01198v1 [cs.LG])
    In recent years, there has been considerable interest in developing machine learning models on graphs in order to account for topological inductive biases. In particular, recent attention was given to Gaussian processes on such structures since they can additionally account for uncertainty. However, graphs are limited to modelling relations between two vertices. In this paper, we go beyond this dyadic setting and consider polyadic relations that include interactions between vertices, edges and one of their generalisations, known as cells. Specifically, we propose Gaussian processes on cellular complexes, a generalisation of graphs that captures interactions between these higher-order cells. One of our key contributions is the derivation of two novel kernels, one that generalises the graph Mat\'ern kernel and one that additionally mixes information of different cell types.
    Time-Independent Information-Theoretic Generalization Bounds for SGLD. (arXiv:2311.01046v1 [cs.LG])
    We provide novel information-theoretic generalization bounds for stochastic gradient Langevin dynamics (SGLD) under the assumptions of smoothness and dissipativity, which are widely used in sampling and non-convex optimization studies. Our bounds are time-independent and decay to zero as the sample size increases, regardless of the number of iterations and whether the step size is fixed. Unlike previous studies, we derive the generalization error bounds by focusing on the time evolution of the Kullback--Leibler divergence, which is related to the stability of datasets and is the upper bound of the mutual information between output parameters and an input dataset. Additionally, we establish the first information-theoretic generalization bound when the training and test loss are the same by showing that a loss function of SGLD is sub-exponential. This bound is also time-independent and removes the problematic step size dependence in existing work, leading to an improved excess risk bound by combining our analysis with the existing non-convex optimization error bounds.
    Deep Transformed Gaussian Processes. (arXiv:2310.18230v2 [cs.LG] UPDATED)
    Transformed Gaussian Processes (TGPs) are stochastic processes specified by transforming samples from the joint distribution from a prior process (typically a GP) using an invertible transformation; increasing the flexibility of the base process. Furthermore, they achieve competitive results compared with Deep Gaussian Processes (DGPs), which are another generalization constructed by a hierarchical concatenation of GPs. In this work, we propose a generalization of TGPs named Deep Transformed Gaussian Processes (DTGPs), which follows the trend of concatenating layers of stochastic processes. More precisely, we obtain a multi-layer model in which each layer is a TGP. This generalization implies an increment of flexibility with respect to both TGPs and DGPs. Exact inference in such a model is intractable. However, we show that one can use variational inference to approximate the required computations yielding a straightforward extension of the popular DSVI inference algorithm Salimbeni et al (2017). The experiments conducted evaluate the proposed novel DTGPs in multiple regression datasets, achieving good scalability and performance.
    Invariant-Feature Subspace Recovery: A New Class of Provable Domain Generalization Algorithms. (arXiv:2311.00966v1 [cs.LG])
    Domain generalization asks for models trained over a set of training environments to generalize well in unseen test environments. Recently, a series of algorithms such as Invariant Risk Minimization (IRM) have been proposed for domain generalization. However, Rosenfeld et al. (2021) shows that in a simple linear data model, even if non-convexity issues are ignored, IRM and its extensions cannot generalize to unseen environments with less than $d_s+1$ training environments, where $d_s$ is the dimension of the spurious-feature subspace. In this work, we propose Invariant-feature Subspace Recovery (ISR): a new class of algorithms to achieve provable domain generalization across the settings of classification and regression problems. First, in the binary classification setup of Rosenfeld et al. (2021), we show that our first algorithm, ISR-Mean, can identify the subspace spanned by invariant features from the first-order moments of the class-conditional distributions, and achieve provable domain generalization with $d_s+1$ training environments. Our second algorithm, ISR-Cov, further reduces the required number of training environments to $O(1)$ using the information of second-order moments. Notably, unlike IRM, our algorithms bypass non-convexity issues and enjoy global convergence guarantees. Next, we extend ISR-Mean to the more general setting of multi-class classification and propose ISR-Multiclass, which leverages class information and provably recovers the invariant-feature subspace with $\lceil d_s/k\rceil+1$ training environments for $k$-class classification. Finally, for regression problems, we propose ISR-Regression that can identify the invariant-feature subspace with $d_s+1$ training environments. Empirically, we demonstrate the superior performance of our ISRs on synthetic benchmarks. Further, ISR can be used as post-processing methods for feature extractors such as neural nets.
    Bridging Machine Learning and Sciences: Opportunities and Challenges. (arXiv:2210.13441v2 [stat.ML] UPDATED)
    The application of machine learning in sciences has seen exciting advances in recent years. As a widely applicable technique, anomaly detection has been long studied in the machine learning community. Especially, deep neural nets-based out-of-distribution detection has made great progress for high-dimensional data. Recently, these techniques have been showing their potential in scientific disciplines. We take a critical look at their applicative prospects including data universality, experimental protocols, model robustness, etc. We discuss examples that display transferable practices and domain-specific challenges simultaneously, providing a starting point for establishing a novel interdisciplinary research paradigm in the near future.
    Long Story Short: Omitted Variable Bias in Causal Machine Learning. (arXiv:2112.13398v4 [econ.EM] UPDATED)
    We derive general, yet simple, sharp bounds on the size of the omitted variable bias for a broad class of causal parameters that can be identified as linear functionals of the conditional expectation function of the outcome. Such functionals encompass many of the traditional targets of investigation in causal inference studies, such as, for example, (weighted) average of potential outcomes, average treatment effects (including subgroup effects, such as the effect on the treated), (weighted) average derivatives, and policy effects from shifts in covariate distribution -- all for general, nonparametric causal models. Our construction relies on the Riesz-Frechet representation of the target functional. Specifically, we show how the bound on the bias depends only on the additional variation that the latent variables create both in the outcome and in the Riesz representer for the parameter of interest. Moreover, in many important cases (e.g, average treatment effects and avearage derivatives) the bound is shown to depend on easily interpretable quantities that measure the explanatory power of the omitted variables. Therefore, simple plausibility judgments on the maximum explanatory power of omitted variables (in explaining treatment and outcome variation) are sufficient to place overall bounds on the size of the bias. Furthermore, we use debiased machine learning to provide flexible and efficient statistical inference on learnable components of the bounds. Finally, empirical examples demonstrate the usefulness of the approach.
    Add and Thin: Diffusion for Temporal Point Processes. (arXiv:2311.01139v1 [cs.LG])
    Autoregressive neural networks within the temporal point process (TPP) framework have become the standard for modeling continuous-time event data. Even though these models can expressively capture event sequences in a one-step-ahead fashion, they are inherently limited for long-term forecasting applications due to the accumulation of errors caused by their sequential nature. To overcome these limitations, we derive ADD-THIN, a principled probabilistic denoising diffusion model for TPPs that operates on entire event sequences. Unlike existing diffusion approaches, ADD-THIN naturally handles data with discrete and continuous components. In experiments on synthetic and real-world datasets, our model matches the state-of-the-art TPP models in density estimation and strongly outperforms them in forecasting.
    Fitted Value Iteration Methods for Bicausal Optimal Transport. (arXiv:2306.12658v2 [stat.ML] UPDATED)
    We develop a fitted value iteration (FVI) method to compute bicausal optimal transport (OT) where couplings have an adapted structure. Based on the dynamic programming formulation, FVI adopts a function class to approximate the value functions in bicausal OT. Under the concentrability condition and approximate completeness assumption, we prove the sample complexity using (local) Rademacher complexity. Furthermore, we demonstrate that multilayer neural networks with appropriate structures satisfy the crucial assumptions required in sample complexity proofs. Numerical experiments reveal that FVI outperforms linear programming and adapted Sinkhorn methods in scalability as the time horizon increases, while still maintaining acceptable accuracy.
    Dyadic Reinforcement Learning. (arXiv:2308.07843v5 [cs.LG] UPDATED)
    Mobile health aims to enhance health outcomes by delivering interventions to individuals as they go about their daily life. The involvement of care partners and social support networks often proves crucial in helping individuals managing burdensome medical conditions. This presents opportunities in mobile health to design interventions that target the dyadic relationship -- the relationship between a target person and their care partner -- with the aim of enhancing social support. In this paper, we develop dyadic RL, an online reinforcement learning algorithm designed to personalize intervention delivery based on contextual factors and past responses of a target person and their care partner. Here, multiple sets of interventions impact the dyad across multiple time intervals. The developed dyadic RL is Bayesian and hierarchical. We formally introduce the problem setup, develop dyadic RL and establish a regret bound. We demonstrate dyadic RL's empirical performance through simulation studies on both toy scenarios and on a realistic test bed constructed from data collected in a mobile health study.
    Tailoring Mixup to Data using Kernel Warping functions. (arXiv:2311.01434v1 [cs.LG])
    Data augmentation is an essential building block for learning efficient deep learning models. Among all augmentation techniques proposed so far, linear interpolation of training data points, also called mixup, has found to be effective for a large panel of applications. While the majority of works have focused on selecting the right points to mix, or applying complex non-linear interpolation, we are interested in mixing similar points more frequently and strongly than less similar ones. To this end, we propose to dynamically change the underlying distribution of interpolation coefficients through warping functions, depending on the similarity between data points to combine. We define an efficient and flexible framework to do so without losing in diversity. We provide extensive experiments for classification and regression tasks, showing that our proposed method improves both performance and calibration of models. Code available in https://github.com/ENSTA-U2IS/torch-uncertainty
    Gaussian Process Priors for Systems of Linear Partial Differential Equations with Constant Coefficients. (arXiv:2212.14319v4 [stat.ML] UPDATED)
    Partial differential equations (PDEs) are important tools to model physical systems and including them into machine learning models is an important way of incorporating physical knowledge. Given any system of linear PDEs with constant coefficients, we propose a family of Gaussian process (GP) priors, which we call EPGP, such that all realizations are exact solutions of this system. We apply the Ehrenpreis-Palamodov fundamental principle, which works as a non-linear Fourier transform, to construct GP kernels mirroring standard spectral methods for GPs. Our approach can infer probable solutions of linear PDE systems from any data such as noisy measurements, or pointwise defined initial and boundary conditions. Constructing EPGP-priors is algorithmic, generally applicable, and comes with a sparse version (S-EPGP) that learns the relevant spectral frequencies and works better for big data sets. We demonstrate our approach on three families of systems of PDEs, the heat equation, wave equation, and Maxwell's equations, where we improve upon the state of the art in computation time and precision, in some experiments by several orders of magnitude.
    Inversion of Bayesian Networks. (arXiv:2212.10649v2 [cs.LG] UPDATED)
    Variational autoencoders and Helmholtz machines use a recognition network (encoder) to approximate the posterior distribution of a generative model (decoder). In this paper we study the necessary and sufficient properties of a recognition network so that it can model the true posterior distribution exactly. These results are derived in the general context of probabilistic graphical modelling / Bayesian networks, for which the network represents a set of conditional independence statements. We derive both global conditions, in terms of d-separation, and local conditions for the recognition network to have the desired qualities. It turns out that for the local conditions the property perfectness (for every node, all parents are joined) plays an important role.
    A Coreset-based, Tempered Variational Posterior for Accurate and Scalable Stochastic Gaussian Process Inference. (arXiv:2311.01409v1 [cs.LG])
    We present a novel stochastic variational Gaussian process ($\mathcal{GP}$) inference method, based on a posterior over a learnable set of weighted pseudo input-output points (coresets). Instead of a free-form variational family, the proposed coreset-based, variational tempered family for $\mathcal{GP}$s (CVTGP) is defined in terms of the $\mathcal{GP}$ prior and the data-likelihood; hence, accommodating the modeling inductive biases. We derive CVTGP's lower bound for the log-marginal likelihood via marginalization of the proposed posterior over latent $\mathcal{GP}$ coreset variables, and show it is amenable to stochastic optimization. CVTGP reduces the learnable parameter size to $\mathcal{O}(M)$, enjoys numerical stability, and maintains $\mathcal{O}(M^3)$ time- and $\mathcal{O}(M^2)$ space-complexity, by leveraging a coreset-based tempered posterior that, in turn, provides sparse and explainable representations of the data. Results on simulated and real-world regression problems with Gaussian observation noise validate that CVTGP provides better evidence lower-bound estimates and predictive root mean squared error than alternative stochastic $\mathcal{GP}$ inference methods.
    On the Lipschitz constant of random neural networks. (arXiv:2311.01356v1 [stat.ML])
    Empirical studies have widely demonstrated that neural networks are highly sensitive to small, adversarial perturbations of the input. The worst-case robustness against these so-called adversarial examples can be quantified by the Lipschitz constant of the neural network. However, only few theoretical results regarding this quantity exist in the literature. In this paper, we initiate the study of the Lipschitz constant of random ReLU neural networks, i.e., neural networks whose weights are chosen at random and which employ the ReLU activation function. For shallow neural networks, we characterize the Lipschitz constant up to an absolute numerical constant. Moreover, we extend our analysis to deep neural networks of sufficiently large width where we prove upper and lower bounds for the Lipschitz constant. These bounds match up to a logarithmic factor that depends on the depth.
    Unreading Race: Purging Protected Features from Chest X-ray Embeddings. (arXiv:2311.01349v1 [cs.LG])
    Purpose: To analyze and remove protected feature effects in chest radiograph embeddings of deep learning models. Materials and Methods: An orthogonalization is utilized to remove the influence of protected features (e.g., age, sex, race) in chest radiograph embeddings, ensuring feature-independent results. To validate the efficacy of the approach, we retrospectively study the MIMIC and CheXpert datasets using three pre-trained models, namely a supervised contrastive, a self-supervised contrastive, and a baseline classifier model. Our statistical analysis involves comparing the original versus the orthogonalized embeddings by estimating protected feature influences and evaluating the ability to predict race, age, or sex using the two types of embeddings. Results: Our experiments reveal a significant influence of protected features on predictions of pathologies. Applying orthogonalization removes these feature effects. Apart from removing any influence on pathology classification, while maintaining competitive predictive performance, orthogonalized embeddings further make it infeasible to directly predict protected attributes and mitigate subgroup disparities. Conclusion: The presented work demonstrates the successful application and evaluation of the orthogonalization technique in the domain of chest X-ray classification.
    A Finite-Particle Convergence Rate for Stein Variational Gradient Descent. (arXiv:2211.09721v5 [cs.LG] UPDATED)
    We provide the first finite-particle convergence rate for Stein variational gradient descent (SVGD), a popular algorithm for approximating a probability distribution with a collection of particles. Specifically, whenever the target distribution is sub-Gaussian with a Lipschitz score, SVGD with n particles and an appropriate step size sequence drives the kernel Stein discrepancy to zero at an order 1/sqrt(log log n) rate. We suspect that the dependence on n can be improved, and we hope that our explicit, non-asymptotic proof strategy will serve as a template for future refinements.
    Kernel-based Joint Independence Tests for Multivariate Stationary and Non-stationary Time Series. (arXiv:2305.08529v3 [stat.ME] UPDATED)
    Multivariate time series data that capture the temporal evolution of interconnected systems are ubiquitous in diverse areas. Understanding the complex relationships and potential dependencies among co-observed variables is crucial for the accurate statistical modelling and analysis of such systems. Here, we introduce kernel-based statistical tests of joint independence in multivariate time series by extending the $d$-variable Hilbert-Schmidt independence criterion (dHSIC) to encompass both stationary and non-stationary processes, thus allowing broader real-world applications. By leveraging resampling techniques tailored for both single- and multiple-realisation time series, we show how the method robustly uncovers significant higher-order dependencies in synthetic examples, including frequency mixing data and logic gates, as well as real-world climate, neuroscience, and socioeconomic data. Our method adds to the mathematical toolbox for the analysis of multivariate time series and can aid in uncovering high-order interactions in data.
    On Learning Gaussian Multi-index Models with Gradient Flow. (arXiv:2310.19793v2 [stat.ML] UPDATED)
    We study gradient flow on the multi-index regression problem for high-dimensional Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear projection and an arbitrary unknown, low-dimensional link function. As such, they constitute a natural template for feature learning in neural networks. We consider a two-timescale algorithm, whereby the low-dimensional link function is learnt with a non-parametric model infinitely faster than the subspace parametrizing the low-rank projection. By appropriately exploiting the matrix semigroup structure arising over the subspace correlation matrices, we establish global convergence of the resulting Grassmannian population gradient flow dynamics, and provide a quantitative description of its associated `saddle-to-saddle' dynamics. Notably, the timescales associated with each saddle can be explicitly characterized in terms of an appropriate Hermite decomposition of the target link function. In contrast with these positive results, we also show that the related \emph{planted} problem, where the link function is known and fixed, in fact has a rough optimization landscape, in which gradient flow dynamics might get trapped with high probability.
    Towards Characterizing the First-order Query Complexity of Learning (Approximate) Nash Equilibria in Zero-sum Matrix Games. (arXiv:2304.12768v2 [cs.GT] UPDATED)
    In the first-order query model for zero-sum $K\times K$ matrix games, players observe the expected pay-offs for all their possible actions under the randomized action played by their opponent. This classical model has received renewed interest after the discovery by Rakhlin and Sridharan that $\epsilon$-approximate Nash equilibria can be computed efficiently from $O(\frac{\ln K}{\epsilon})$ instead of $O(\frac{\ln K}{\epsilon^2})$ queries. Surprisingly, the optimal number of such queries, as a function of both $\epsilon$ and $K$, is not known. We make progress on this question on two fronts. First, we fully characterise the query complexity of learning exact equilibria ($\epsilon=0$), by showing that they require a number of queries that is linear in $K$, which means that it is essentially as hard as querying the whole matrix, which can also be done with $K$ queries. Second, for $\epsilon > 0$, the current query complexity upper bound stands at $O(\min(\frac{\ln(K)}{\epsilon} , K))$. We argue that, unfortunately, obtaining a matching lower bound is not possible with existing techniques: we prove that no lower bound can be derived by constructing hard matrices whose entries take values in a known countable set, because such matrices can be fully identified by a single query. This rules out, for instance, reducing to an optimization problem over the hypercube by encoding it as a binary payoff matrix. We then introduce a new technique for lower bounds, which allows us to obtain lower bounds of order $\tilde\Omega(\log(\frac{1}{K\epsilon})$ for any $\epsilon \leq 1 / (cK^4)$, where $c$ is a constant independent of $K$. We further discuss possible future directions to improve on our techniques in order to close the gap with the upper bounds.
    Generalized Bayesian Inference for Scientific Simulators via Amortized Cost Estimation. (arXiv:2305.15208v2 [stat.ML] UPDATED)
    Simulation-based inference (SBI) enables amortized Bayesian inference for simulators with implicit likelihoods. But when we are primarily interested in the quality of predictive simulations, or when the model cannot exactly reproduce the observed data (i.e., is misspecified), targeting the Bayesian posterior may be overly restrictive. Generalized Bayesian Inference (GBI) aims to robustify inference for (misspecified) simulator models, replacing the likelihood-function with a cost function that evaluates the goodness of parameters relative to data. However, GBI methods generally require running multiple simulations to estimate the cost function at each parameter value during inference, making the approach computationally infeasible for even moderately complex simulators. Here, we propose amortized cost estimation (ACE) for GBI to address this challenge: We train a neural network to approximate the cost function, which we define as the expected distance between simulations produced by a parameter and observed data. The trained network can then be used with MCMC to infer GBI posteriors for any observation without running additional simulations. We show that, on several benchmark tasks, ACE accurately predicts cost and provides predictive simulations that are closer to synthetic observations than other SBI methods, especially for misspecified simulators. Finally, we apply ACE to infer parameters of the Hodgkin-Huxley model given real intracellular recordings from the Allen Cell Types Database. ACE identifies better data-matching parameters while being an order of magnitude more simulation-efficient than a standard SBI method. In summary, ACE combines the strengths of SBI methods and GBI to perform robust and simulation-amortized inference for scientific simulators.
    Exclusive Group Lasso for Structured Variable Selection. (arXiv:2108.10284v2 [cs.LG] UPDATED)
    A structured variable selection problem is considered in which the covariates, divided into predefined groups, activate according to sparse patterns with few nonzero entries per group. Capitalizing on the concept of atomic norm, a composite norm can be properly designed to promote such exclusive group sparsity patterns. The resulting norm lends itself to efficient and flexible regularized optimization algorithms for support recovery, like the proximal algorithm. Moreover, an active set algorithm is proposed that builds the solution by successively including structure atoms into the estimated support. It is also shown that such an algorithm can be tailored to match more rigid structures than plain exclusive group sparsity. Asymptotic consistency analysis (with both the number of parameters as well as the number of groups growing with the observation size) establishes the effectiveness of the proposed solution in terms of signed support recovery under conventional assumptions. Finally, a set of numerical simulations further corroborates the results.
    Targeted Separation and Convergence with Kernel Discrepancies. (arXiv:2209.12835v2 [stat.ML] UPDATED)
    Maximum mean discrepancies (MMDs) like the kernel Stein discrepancy (KSD) have grown central to a wide range of applications, including hypothesis testing, sampler selection, distribution approximation, and variational inference. In each setting, these kernel-based discrepancy measures are required to (i) separate a target P from other probability measures or even (ii) control weak convergence to P. In this article we derive new sufficient and necessary conditions to ensure (i) and (ii). For MMDs on separable metric spaces, we characterize those kernels that separate Bochner embeddable measures and introduce simple conditions for separating all measures with unbounded kernels and for controlling convergence with bounded kernels. We use these results on $\mathbb{R}^d$ to substantially broaden the known conditions for KSD separation and convergence control and to develop the first KSDs known to exactly metrize weak convergence to P. Along the way, we highlight the implications of our results for hypothesis testing, measuring and improving sample quality, and sampling with Stein variational gradient descent.  ( 2 min )
    The Behavior and Convergence of Local Bayesian Optimization. (arXiv:2305.15572v2 [cs.LG] UPDATED)
    A recent development in Bayesian optimization is the use of local optimization strategies, which can deliver strong empirical performance on high-dimensional problems compared to traditional global strategies. The "folk wisdom" in the literature is that the focus on local optimization sidesteps the curse of dimensionality; however, little is known concretely about the expected behavior or convergence of Bayesian local optimization routines. We first study the behavior of the local approach, and find that the statistics of individual local solutions of Gaussian process sample paths are surprisingly good compared to what we would expect to recover from global methods. We then present the first rigorous analysis of such a Bayesian local optimization algorithm recently proposed by M\"uller et al. (2021), and derive convergence rates in both the noisy and noiseless settings.  ( 2 min )
    Computable Phenotypes of Patient Acuity in the Intensive Care Unit. (arXiv:2005.05163v2 [q-bio.QM] UPDATED)
    Continuous monitoring and patient acuity assessments are key aspects of Intensive Care Unit (ICU) practice, but both are limited by time constraints imposed on healthcare providers. Moreover, anticipating clinical trajectories remains imprecise. The objectives of this study are to (1) develop an electronic phenotype of acuity using automated variable retrieval within the electronic health records and (2) describe transitions between acuity states that illustrate the clinical trajectories of ICU patients. We gathered two single-center, longitudinal electronic health record datasets for 51,372 adult ICU patients admitted to the University of Florida Health (UFH) Gainesville (GNV) and Jacksonville (JAX). We developed algorithms to quantify acuity status at four-hour intervals for each ICU admission and identify acuity phenotypes using continuous acuity status and k-means clustering approach. 51,073 admissions for 38,749 patients in the UFH GNV dataset and 22,219 admissions for 12,623 patients in the UFH JAX dataset had at least one ICU stay lasting more than four hours. There were three phenotypes: persistently stable, persistently unstable, and transitioning from unstable to stable. For stable patients, approximately 0.7%-1.7% would transition to unstable, 0.02%-0.1% would expire, 1.2%-3.4% would be discharged, and the remaining 96%-97% would remain stable in the ICU every four hours. For unstable patients, approximately 6%-10% would transition to stable, 0.4%-0.5% would expire, and the remaining 89%-93% would remain unstable in the ICU in the next four hours. We developed phenotyping algorithms for patient acuity status every four hours while admitted to the ICU. This approach may be useful in developing prognostic and clinical decision-support tools to aid patients, caregivers, and providers in shared decision-making processes regarding escalation of care and patient values.  ( 3 min )
    Amortized Simulation-Based Frequentist Inference for Tractable and Intractable Likelihoods. (arXiv:2306.07769v2 [stat.ME] UPDATED)
    High-fidelity simulators that connect theoretical models with observations are indispensable tools in many sciences. When coupled with machine learning, a simulator makes it possible to infer the parameters of a theoretical model directly from real and simulated observations without explicit use of the likelihood function. This is of particular interest when the latter is intractable. In this work, we introduce a simple extension of the recently proposed likelihood-free frequentist inference (LF2I) approach that has some computational advantages. Like LF2I, this extension yields provably valid confidence sets in parameter inference problems in which a high-fidelity simulator is available. The utility of our algorithm is illustrated by applying it to three pedagogically interesting examples: the first is from cosmology, the second from high-energy physics and astronomy, both with tractable likelihoods, while the third, with an intractable likelihood, is from epidemiology.  ( 2 min )
    Discrepancy Modeling Framework: Learning missing physics, modeling systematic residuals, and disambiguating between deterministic and random effects. (arXiv:2203.05164v2 [stat.ML] UPDATED)
    Physics-based and first-principles models pervade the engineering and physical sciences, allowing for the ability to model the dynamics of complex systems with a prescribed accuracy. The approximations used in deriving governing equations often result in discrepancies between the model and sensor-based measurements of the system, revealing the approximate nature of the equations and/or the signal-to-noise ratio of the sensor itself. In modern dynamical systems, such discrepancies between model and measurement can lead to poor quantification, often undermining the ability to produce accurate and precise control algorithms. We introduce a discrepancy modeling framework to identify the missing physics and resolve the model-measurement mismatch with two distinct approaches: (i) by learning a model for the evolution of systematic state-space residual, and (ii) by discovering a model for the deterministic dynamical error. Regardless of approach, a common suite of data-driven model discovery methods can be used. The choice of method depends on one's intent (e.g., mechanistic interpretability) for discrepancy modeling, sensor measurement characteristics (e.g., quantity, quality, resolution), and constraints imposed by practical applications (e.g., modeling approaches using the suite of data-driven modeling methods on three continuous dynamical systems under varying signal-to-noise ratios. Finally, we emphasize structural shortcomings of each discrepancy modeling approach depending on error type. In summary, if the true dynamics are unknown (i.e., an imperfect model), one should learn a discrepancy model of the missing physics in the dynamical space. Yet, if the true dynamics are known yet model-measurement mismatch still exists, one should learn a discrepancy model in the state space.  ( 3 min )
    Sequence Modeling with Multiresolution Convolutional Memory. (arXiv:2305.01638v2 [cs.LG] UPDATED)
    Efficiently capturing the long-range patterns in sequential data sources salient to a given task -- such as classification and generative modeling -- poses a fundamental challenge. Popular approaches in the space tradeoff between the memory burden of brute-force enumeration and comparison, as in transformers, the computational burden of complicated sequential dependencies, as in recurrent neural networks, or the parameter burden of convolutional networks with many or large filters. We instead take inspiration from wavelet-based multiresolution analysis to define a new building block for sequence modeling, which we call a MultiresLayer. The key component of our model is the multiresolution convolution, capturing multiscale trends in the input sequence. Our MultiresConv can be implemented with shared filters across a dilated causal convolution tree. Thus it garners the computational advantages of convolutional networks and the principled theoretical motivation of wavelet decompositions. Our MultiresLayer is straightforward to implement, requires significantly fewer parameters, and maintains at most a $\mathcal{O}(N\log N)$ memory footprint for a length $N$ sequence. Yet, by stacking such layers, our model yields state-of-the-art performance on a number of sequence classification and autoregressive density estimation tasks using CIFAR-10, ListOps, and PTB-XL datasets.  ( 2 min )
    Conformal Prediction for Time Series with Modern Hopfield Networks. (arXiv:2303.12783v2 [cs.LG] UPDATED)
    To quantify uncertainty, conformal prediction methods are gaining continuously more interest and have already been successfully applied to various domains. However, they are difficult to apply to time series as the autocorrelative structure of time series violates basic assumptions required by conformal prediction. We propose HopCPT, a novel conformal prediction approach for time series that not only copes with temporal structures but leverages them. We show that our approach is theoretically well justified for time series where temporal dependencies are present. In experiments, we demonstrate that our new approach outperforms state-of-the-art conformal prediction methods on multiple real-world time series datasets from four different domains.  ( 2 min )
    Manifold-augmented Eikonal Equations: Geodesic Distances and Flows on Differentiable Manifolds. (arXiv:2310.06157v2 [cs.CG] UPDATED)
    Manifolds discovered by machine learning models provide a compact representation of the underlying data. Geodesics on these manifolds define locally length-minimising curves and provide a notion of distance, which are key for reduced-order modelling, statistical inference, and interpolation. In this work, we propose a model-based parameterisation for distance fields and geodesic flows on manifolds, exploiting solutions of a manifold-augmented Eikonal equation. We demonstrate how the geometry of the manifold impacts the distance field, and exploit the geodesic flow to obtain globally length-minimising curves directly. This work opens opportunities for statistics and reduced-order modelling on differentiable manifolds.  ( 2 min )
    Sample-efficient Multi-objective Molecular Optimization with GFlowNets. (arXiv:2302.04040v2 [cs.LG] UPDATED)
    Many crucial scientific problems involve designing novel molecules with desired properties, which can be formulated as a black-box optimization problem over the discrete chemical space. In practice, multiple conflicting objectives and costly evaluations (e.g., wet-lab experiments) make the diversity of candidates paramount. Computational methods have achieved initial success but still struggle with considering diversity in both objective and search space. To fill this gap, we propose a multi-objective Bayesian optimization (MOBO) algorithm leveraging the hypernetwork-based GFlowNets (HN-GFN) as an acquisition function optimizer, with the purpose of sampling a diverse batch of candidate molecular graphs from an approximate Pareto front. Using a single preference-conditioned hypernetwork, HN-GFN learns to explore various trade-offs between objectives. We further propose a hindsight-like off-policy strategy to share high-performing molecules among different preferences in order to speed up learning for HN-GFN. We empirically illustrate that HN-GFN has adequate capacity to generalize over preferences. Moreover, experiments in various real-world MOBO settings demonstrate that our framework predominantly outperforms existing methods in terms of candidate quality and sample efficiency. The code is available at https://github.com/violet-sto/HN-GFN.  ( 2 min )
    Lossy Image Compression with Conditional Diffusion Models. (arXiv:2209.06950v6 [eess.IV] UPDATED)
    This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional "content" latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining "texture" variables characterizing the diffusion process are synthesized at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving multiple datasets and image quality assessment metrics show that our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics. Furthermore, training the diffusion with X-parameterization enables high-quality reconstructions in only a handful of decoding steps, greatly affecting the model's practicality.  ( 2 min )
    Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models. (arXiv:2311.00871v1 [cs.LG])
    Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-context learning (ICL) -- to perform new tasks when prompted with unseen input-output examples without any explicit model training. In this work, we study how effectively transformers can bridge between their pretraining data mixture, comprised of multiple distinct task families, to identify and learn new tasks in-context which are both inside and outside the pretraining distribution. Building on previous work, we investigate this question in a controlled setting, where we study transformer models trained on sequences of $(x, f(x))$ pairs rather than natural language. Our empirical results show transformers demonstrate near-optimal unsupervised model selection capabilities, in their ability to first in-context identify different task families and in-context learn within them when the task families are well-represented in their pretraining data. However when presented with tasks or functions which are out-of-domain of their pretraining data, we demonstrate various failure modes of transformers and degradation of their generalization for even simple extrapolation tasks. Together our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.  ( 2 min )
    Contrastive Moments: Unsupervised Halfspace Learning in Polynomial Time. (arXiv:2311.01435v1 [cs.LG])
    We give a polynomial-time algorithm for learning high-dimensional halfspaces with margins in $d$-dimensional space to within desired TV distance when the ambient distribution is an unknown affine transformation of the $d$-fold product of an (unknown) symmetric one-dimensional logconcave distribution, and the halfspace is introduced by deleting at least an $\epsilon$ fraction of the data in one of the component distributions. Notably, our algorithm does not need labels and establishes the unique (and efficient) identifiability of the hidden halfspace under this distributional assumption. The sample and time complexity of the algorithm are polynomial in the dimension and $1/\epsilon$. The algorithm uses only the first two moments of suitable re-weightings of the empirical distribution, which we call contrastive moments; its analysis uses classical facts about generalized Dirichlet polynomials and relies crucially on a new monotonicity property of the moment ratio of truncations of logconcave distributions. Such algorithms, based only on first and second moments were suggested in earlier work, but hitherto eluded rigorous guarantees. Prior work addressed the special case when the underlying distribution is Gaussian via Non-Gaussian Component Analysis. We improve on this by providing polytime guarantees based on Total Variation (TV) distance, in place of existing moment-bound guarantees that can be super-polynomial. Our work is also the first to go beyond Gaussians in this setting.  ( 2 min )
    Generalizing Nonlinear ICA Beyond Structural Sparsity. (arXiv:2311.00866v1 [cs.LG])
    Nonlinear independent component analysis (ICA) aims to uncover the true latent sources from their observable nonlinear mixtures. Despite its significance, the identifiability of nonlinear ICA is known to be impossible without additional assumptions. Recent advances have proposed conditions on the connective structure from sources to observed variables, known as Structural Sparsity, to achieve identifiability in an unsupervised manner. However, the sparsity constraint may not hold universally for all sources in practice. Furthermore, the assumptions of bijectivity of the mixing process and independence among all sources, which arise from the setting of ICA, may also be violated in many real-world scenarios. To address these limitations and generalize nonlinear ICA, we propose a set of new identifiability results in the general settings of undercompleteness, partial sparsity and source dependence, and flexible grouping structures. Specifically, we prove identifiability when there are more observed variables than sources (undercomplete), and when certain sparsity and/or source independence assumptions are not met for some changing sources. Moreover, we show that even in cases with flexible grouping structures (e.g., part of the sources can be divided into irreducible independent groups with various sizes), appropriate identifiability results can also be established. Theoretical claims are supported empirically on both synthetic and real-world datasets.  ( 2 min )
    Stochastic Smoothed Gradient Descent Ascent for Federated Minimax Optimization. (arXiv:2311.00944v1 [stat.ML])
    In recent years, federated minimax optimization has attracted growing interest due to its extensive applications in various machine learning tasks. While Smoothed Alternative Gradient Descent Ascent (Smoothed-AGDA) has proved its success in centralized nonconvex minimax optimization, how and whether smoothing technique could be helpful in federated setting remains unexplored. In this paper, we propose a new algorithm termed Federated Stochastic Smoothed Gradient Descent Ascent (FESS-GDA), which utilizes the smoothing technique for federated minimax optimization. We prove that FESS-GDA can be uniformly used to solve several classes of federated minimax problems and prove new or better analytical convergence results for these settings. We showcase the practical efficiency of FESS-GDA in practical federated learning tasks of training generative adversarial networks (GANs) and fair classification.  ( 2 min )
    Scalable Counterfactual Distribution Estimation in Multivariate Causal Models. (arXiv:2311.00927v1 [stat.ML])
    We consider the problem of estimating the counterfactual joint distribution of multiple quantities of interests (e.g., outcomes) in a multivariate causal model extended from the classical difference-in-difference design. Existing methods for this task either ignore the correlation structures among dimensions of the multivariate outcome by considering univariate causal models on each dimension separately and hence produce incorrect counterfactual distributions, or poorly scale even for moderate-size datasets when directly dealing with such multivariate causal model. We propose a method that alleviates both issues simultaneously by leveraging a robust latent one-dimensional subspace of the original high-dimension space and exploiting the efficient estimation from the univariate causal model on such space. Since the construction of the one-dimensional subspace uses information from all the dimensions, our method can capture the correlation structures and produce good estimates of the counterfactual distribution. We demonstrate the advantages of our approach over existing methods on both synthetic and real-world data.  ( 2 min )
    Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis. (arXiv:2311.01052v1 [stat.ML])
    We introduce Resilient Multiple Choice Learning (rMCL), an extension of the MCL approach for conditional distribution estimation in regression settings where multiple targets may be sampled for each training input. Multiple Choice Learning is a simple framework to tackle multimodal density estimation, using the Winner-Takes-All (WTA) loss for a set of hypotheses. In regression settings, the existing MCL variants focus on merging the hypotheses, thereby eventually sacrificing the diversity of the predictions. In contrast, our method relies on a novel learned scoring scheme underpinned by a mathematical framework based on Voronoi tessellations of the output space, from which we can derive a probabilistic interpretation. After empirically validating rMCL with experiments on synthetic data, we further assess its merits on the sound source localization problem, demonstrating its practical usefulness and the relevance of its interpretation.  ( 2 min )
    High-dimensional Linear Bandits with Knapsacks. (arXiv:2311.01327v1 [cs.LG])
    We study the contextual bandits with knapsack (CBwK) problem under the high-dimensional setting where the dimension of the feature is large. The reward of pulling each arm equals the multiplication of a sparse high-dimensional weight vector and the feature of the current arrival, with additional random noise. In this paper, we investigate how to exploit this sparsity structure to achieve improved regret for the CBwK problem. To this end, we first develop an online variant of the hard thresholding algorithm that performs the sparse estimation in an online manner. We further combine our online estimator with a primal-dual framework, where we assign a dual variable to each knapsack constraint and utilize an online learning algorithm to update the dual variable, thereby controlling the consumption of the knapsack capacity. We show that this integrated approach allows us to achieve a sublinear regret that depends logarithmically on the feature dimension, thus improving the polynomial dependency established in the previous literature. We also apply our framework to the high-dimension contextual bandit problem without the knapsack constraint and achieve optimal regret in both the data-poor regime and the data-rich regime. We finally conduct numerical experiments to show the efficient empirical performance of our algorithms under the high dimensional setting.  ( 2 min )
    Time-series Generation by Contrastive Imitation. (arXiv:2311.01388v1 [stat.ML])
    Consider learning a generative model for time-series data. The sequential setting poses a unique challenge: Not only should the generator capture the conditional dynamics of (stepwise) transitions, but its open-loop rollouts should also preserve the joint distribution of (multi-step) trajectories. On one hand, autoregressive models trained by MLE allow learning and computing explicit transition distributions, but suffer from compounding error during rollouts. On the other hand, adversarial models based on GAN training alleviate such exposure bias, but transitions are implicit and hard to assess. In this work, we study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy, where the reinforcement signal is provided by a global (but stepwise-decomposable) energy model trained by contrastive estimation. At training, the two components are learned cooperatively, avoiding the instabilities typical of adversarial objectives. At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality. By expressly training a policy to imitate sequential behavior of time-series features in a dataset, this approach embodies "generation by imitation". Theoretically, we illustrate the correctness of this formulation and the consistency of the algorithm. Empirically, we evaluate its ability to generate predictively useful samples from real-world datasets, verifying that it performs at the standard of existing benchmarks.  ( 2 min )
    Bounding Wasserstein distance with couplings. (arXiv:2112.03152v3 [stat.CO] UPDATED)
    Markov chain Monte Carlo (MCMC) provides asymptotically consistent estimates of intractable posterior expectations as the number of iterations tends to infinity. However, in large data applications, MCMC can be computationally expensive per iteration. This has catalyzed interest in approximating MCMC in a manner that improves computational speed per iteration but does not produce asymptotically consistent estimates. In this article, we propose estimators based on couplings of Markov chains to assess the quality of such asymptotically biased sampling methods. The estimators give empirical upper bounds of the Wasserstein distance between the limiting distribution of the asymptotically biased sampling method and the original target distribution of interest. We establish theoretical guarantees for our upper bounds and show that our estimators can remain effective in high dimensions. We apply our quality measures to stochastic gradient MCMC, variational Bayes, and Laplace approximations for tall data and to approximate MCMC for Bayesian logistic regression in 4500 dimensions and Bayesian linear regression in 50000 dimensions.  ( 2 min )
    Data-Driven Model Selections of Second-Order Particle Dynamics via Integrating Gaussian Processes with Low-Dimensional Interacting Structures. (arXiv:2311.00902v1 [stat.ML])
    In this paper, we focus on the data-driven discovery of a general second-order particle-based model that contains many state-of-the-art models for modeling the aggregation and collective behavior of interacting agents of similar size and body type. This model takes the form of a high-dimensional system of ordinary differential equations parameterized by two interaction kernels that appraise the alignment of positions and velocities. We propose a Gaussian Process-based approach to this problem, where the unknown model parameters are marginalized by using two independent Gaussian Process (GP) priors on latent interaction kernels constrained to dynamics and observational data. This results in a nonparametric model for interacting dynamical systems that accounts for uncertainty quantification. We also develop acceleration techniques to improve scalability. Moreover, we perform a theoretical analysis to interpret the methodology and investigate the conditions under which the kernels can be recovered. We demonstrate the effectiveness of the proposed approach on various prototype systems, including the selection of the order of the systems and the types of interactions. In particular, we present applications to modeling two real-world fish motion datasets that display flocking and milling patterns up to 248 dimensions. Despite the use of small data sets, the GP-based approach learns an effective representation of the nonlinear dynamics in these spaces and outperforms competitor methods.  ( 3 min )

  • Open

    [Research] Which python libraries do you recommend for label ranking?
    I'm currently looking for python libraries that offer models for the label ranking problem. So provided with a context x and a set of Labels Y, the model should output a ranking of those labels. I'm mostly interested in models that implements the Ranking by comparison method (RPC) and the Plackett-Luce Model. I would be grateful for any hints. submitted by /u/Emergency_Caramel_69 [link] [comments]  ( 9 min )
    [D] Tuning Models for Closed Book Q&A
    After transitioning from a different career path into ML/AI, I've embarked on a self-learning journey that's been incredibly rewarding yet challenging at times. I'd love to get your insights on a project I'm working on. Here’s what I’ve done so far: Dived into AI research papers and theoretical foundations. Set up a personal server with high-end GPUs. Conducted experiments with models like Llama2. Implemented low-rank adapters in models for document generation. Trained transformers using pure PyTorch (steering clear of Hugging Face from now on). My current challenge is training a model that can perform Q&A on documents it has generated. While the Flan paper has provided some direction, practical application has proven to be complex, especially when balancing resource use and noise constraints(servers are very loud and I'd like to minimize training time for this reason). I'm reaching out for: Insights on training Llama (or similar models) for Q&A tasks on generated content. Experiences or lessons learned from implementing advice from the Flan paper or similar research. Tips on efficient resource management during training (to alleviate time, noise, and power concerns). Your collective wisdom would be a beacon for a lone learner like myself. If there are foundational concepts I might be missing or resources I should consult, please point me in the right direction. Thank you for your time and help! submitted by /u/TheRealBracketMaster [link] [comments]  ( 9 min )
    [D] [P] First time delving into Gen AI
    Hi, I need to make a powerpoint presentation for a startup founder who might hire me. He owns a digital/social media marketing company that also does website development. So I need to show the following things in the presentation:- Create an automated LDM model/service with very high accuracy (>90%) for UI/UX design. Goal is to minimize the role of graphic designers as much as possible. Create an automated LLM model/service with very high accuracy (>95%) for onboarding/handling customers as well as customer support. Currently, the task is managed by a bunch of product managers and ChatGPT powered chatbots. Goal is to minimize the role of these product managers as much as possible. Besides the above, if you guys have any other AI solutions that I could include then I'd be very grateful. The more valuable solutions I can pitch to the founder, the better chances I'll have to getting a job in his company. Please help me because I've never really played with Gen AI before. Please point me in the right direction 🙏 Thanks in advance! submitted by /u/master-killerrr [link] [comments]
    [D] Label my own data for Fine-tuning OCR?
    Hi! I was wondering if anyone could give me a heads up on what I can do to label my own corpus of data to fine tune an OCR model. An example of what I want to detect better is shown here https://files.catbox.moe/bnsb2i.png Right now stuff like amazon textextract can't pickup the superscript footnote markers (the small 1,2,3) very accurately, and I'd like to manually mark some datasets to hopefully increase this. This is actually a very common book format, and yeah I do realize it will take a lot of time but it's better than nothing I suppose. Thanks and God bless! submitted by /u/angel__-__- [link] [comments]  ( 9 min )
    [D]Does view function in PyTorch cause loss of information? I want to understand more about tensor dimension compression
    So let's say I have an encoded tensor of size [1024,4,20] which represents [number_of_people, 4 choices, embedding_dim] Which mean each individual person will have 4 choices on, let's say food, and this differ accross the entire population (which is now a batch). This tensor is already encoded with torch.nn.Embedding, where 20 is the dim size. I need to compress this into a tensor of size [4,20] so that all those "meal choices" information of each unique person is now a 2d tensor, since I need to multiply this matrix with another embeddings. I'm still in my learning phase with pytorch and trying to go through documents on dimension reduction (I'm familiar with all the reshaping and etc, but this is the first time I actually have to really think and be careful since it could mean I might lose information if I don't do this correctly). Could someone explain to me why using certain things, like view, would not lose these specific information about each individual person and ensuring that I can carry on to use this compressed "choices embedding" on other thing with a peace of mind that I'm doing it correctly. I guess this comes with my lack of understanding in linear algebra, any guidance on how to understand this topic deeply would be greatly appreciated. I was trying to get a resulting tensor of size [1024,4] but my scoring function would give the tensor of size [1024,1024,4] if I multiple [1024,20] with [1024,20,4] (permuted choices tensor). At this point I'm also not sure if I should reduce it earlier on with original encoded tensor or as the last step on the resulting tensor. submitted by /u/parz01000101 [link] [comments]  ( 9 min )
    [D] What do you fine folks think of this - Causal AI ?
    Stanford blog post - https://ssir.org/articles/entry/the_case_for_causal_ai Found this other link that seems more ready - https://causalens.com/ What do you think ? Next wonder or smokescreen ? Update - Found causalens website - https://preview.redd.it/x8azqhw136yb1.png?width=527&format=png&auto=webp&s=2d0cb3e0139233b2e9f9598e10d6060a6a4105cb Makes me wonder why this cannot be implemented with a ML / DL model ? Its human designed input to be structured as a constraint.. what do you think ? submitted by /u/dpadhy [link] [comments]  ( 9 min )
    [D] Comparing RL and LLM Prompting for Modern Game Playing AI Systems (a video)
    Hey people, I wanted to share a video from my ML YouTube channel discussing the state of the art methods for game playing AI systems post the LLM boom. Of course, this space was dominated by Reinforcement Learning for most of the 2010s, but there has been some interesting work towards using LLMs solo or as an “RL assistant” to train better RL agents. Here’s my video breaking down the complex prompting systems that let LLMs like GPT4 play Minecraft-like open world games and reflect on their progress. Hope people who are interested find it worthwhile… https://youtu.be/cXfnNoMgCio submitted by /u/AvvYaa [link] [comments]  ( 9 min )
    [P] Vector quantization methods
    ​ https://preview.redd.it/mq8y8as0q4yb1.jpg?width=1374&format=pjpg&auto=webp&s=41a28968c929bf5a44fab03bef67991319e5728b txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows. txtai has built-in support for quantizing vectors. The code above shows how 1-bit (binary) quantization can be applied. With 1-bit quantization, each dimension is transformed into a 1-bit value (0 or 1). Those bits are grouped into uint8's. This method can retain a surprising amount of accuracy, especially with high dimension models. See the article below for more information along with benchmarks. Article: https://neuml.hashnode.dev/all-about-vector-quantization GitHub: https://github.com/neuml/txtai submitted by /u/davidmezzetti [link] [comments]  ( 9 min )
    [D] Overfitting
    Assuming that my training and testing set have the same distribution. If my training performance is very good but clearing overfitting, can I say that there MUST be a way to tune the model so that the testing performance will also improve? Or what other assumptions on my data do I have to make? submitted by /u/jef_107 [link] [comments]  ( 9 min )
    [D] Seeking Clarification on LoRA, Adapters, and Prefix Tuning in LLMs
    I've come across several "parameters efficient" finetuning methods, LoRA (Low-Rank Adaptation), adapters, and prefix tuning, and I'm trying to understand the differences between them in the context of LLMs. I'm curious about the specific advantages and disadvantages of each method. For instance, how does the efficiency and performance of LoRA (that modify a selected subset of parameters) compare to adapters and prefix tuning (that add a small amount of parameters)? Are there any significant trade-offs to consider when choosing between these methods? I've come across a lots of LLMs trained with LoRA, but I'm struggling to find models trained with adapters or prefix tuning. Any guidance on this would be greatly appreciated. Also, Is it possible to use LoRA, adapters, and prefix tuning simultaneously in a single LLM? If so, are there any known benefits or drawbacks to this approach? Thank you in advance for your insights. submitted by /u/Distinct-Target7503 [link] [comments]  ( 9 min )
    [D] Making an autoencoder rotation invariant for image clustering?
    I'm trying to cluster PDF files that I've converted into images, and I've gotten a good suggestion to train an autoencoder with convolutional layers and cluster in the latent space. I'm hoping to implement this with Keras. The problem I'm running into is that these PDF files are scans, so some of the files are slightly rotated, and some of them are rotated by a full 90 degrees. As far as I know autoencoders are generally not rotation invariant, and all I was able to find online is a solution to a weird problem that involves 2d images of objects rotated in 3d. Is there a way to make an autoencoder that does have simple rotation invariance? submitted by /u/TeenColonistWrangler [link] [comments]  ( 9 min )
    [D] Data extraction with LLMs: JSON, CSV or …?
    I’ve been reading about a bunch of different methods for extracting structured data from text with LLMs (from docs, audio transcripts, etc). One approach is entity extraction into a connected knowledge graph (ie people, places, things). Another is providing a JSON schema to extract into, and outputting JSON. I’ve also seen table extraction and outputting CSV. 🙋‍♂️ If you’ve been using (or want to use) LLM data extraction in your workflows, which method have you been using (or are looking to use in future)? I’d be interested to learn what methods are needed for real apps, vs what’s just been used for one-off demos. Appreciate any insight! submitted by /u/DeadPukka [link] [comments]
    [D] About Scientific Machine Learning
    [D] Hi everyone! I am new to SciML. When I read papers like Neural ODEs or Liquid time-constant Neural Network (LTC), there are both familiar and new principles in mathematics. I can use Google or chatGPT to understand the new principles but I am looking for books that I can learn more and dive dive into the field of SciML. Anyone can suggest such kinds of books. Thank you! submitted by /u/luciffer_ [link] [comments]
    [D] Trying to remember the name of a famous paper...
    I'm trying to find a paper a I read a while back. I believe I heard about it from this subreddit. It was old. Maybe even from the 50s or 60s. The way I remember it, it starts by discussing some general properties of entropy and then derives logistic regression as a maximum entropy model. It had sort of a physics/information theory flavor to it. At least thats how I remember it. Does that sound familiar to anyone? ​ edit: Found it thanks to /u/TastyOs. "Information theory and statistical mechanics" by ET Jaynes (1957). https://journals.aps.org/pr/abstract/10.1103/PhysRev.106.620 Non paywall version: https://bayes.wustl.edu/etj/articles/theory.1.pdf Although now I feel like "Entropy shall be all that a Man requires" would have been a much better title. submitted by /u/12tone [link] [comments]  ( 9 min )
    [D] ViT model design
    Hello everyone, I have described my understanding of the ViT architecture in the below diagram, I did not use normalization, skip connection or Multi-head Self-attention. Otherwise, if you find any mistake please let me know. Further, I have some questions. 1- In self-attention according to my understanding we pass all 256 + 1 tokens (that is basically the whole image +1 class token) to the 3 different linear layers. and find the similarity between the 2 and then do element-wise multiplication with the third to increase or decrease the magnitude, is this correct? 2- Do these linear layers in self-attention have any relationship between them or are 3 distinguish layers? 3- Further I don't understand the concept behind the class token, why we are using that if we remove that and use all 256 tokens (that will make more sense) in the classification layer is it not possible? ​ ​ https://preview.redd.it/jcmfwx1x32yb1.jpg?width=1200&format=pjpg&auto=webp&s=1d3757feb06e7576e64c5725bd460c760eadf9af submitted by /u/NoEntertainment6225 [link] [comments]  ( 9 min )
    [D] Model downloading speed is limited if not logged in to HuggingFace, help!
    When I directly click the download button on HuggingFace website repo: (Logged in) Download speed 10m/s (Not logged in) Download speed less than 500k/s I have searched over the entire internet and cannot find why. The problem is that, now I have to use git clone to download the model with command line in the linux server, however even if I use HuggingFace-cli and my token, the speed is still less than 500k/s, same as "not logged in". Does anyone have any idea? This problem is really confusing. submitted by /u/CindyIH [link] [comments]  ( 9 min )
    [R] Telling GPT-4 you're scared or under pressure improves performance
    In a recent paper, researchers have discovered that LLMs show enhanced performance when provided with prompts infused with emotional context, which they call "EmotionPrompts." These prompts incorporate sentiments of urgency or importance, such as "It's crucial that I get this right for my thesis defense," as opposed to neutral prompts like "Please provide feedback." The study's empirical evidence suggests substantial gains. This indicates a significant sensitivity of LLMs to the implied emotional stakes in a prompt: Deterministic tasks saw an 8% performance boost Generative tasks experienced a 115% improvement when benchmarked using BIG-Bench. Human evaluators further validated these findings, observing a 10.9% increase in the perceived quality of responses when EmotionPrompts were used. This enhancement is attributed to the models' capacity to detect and prioritize the heightened language patterns that imply a need for precision and care in the response. The research delineates the potential of EmotionPrompts to refine the effectiveness of AI in applications where understanding the user's intent and urgency is paramount, even though the AI does not genuinely comprehend or feel emotions. TLDR: Research shows LLMs deliver better results when prompts signal emotional urgency. This insight can be leveraged to improve AI applications by integrating EmotionPrompts into the design of user interactions. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] Recent discussion on X on doing a PhD vs working in industry
    This tweet (https://x.com/sshkhr16/status/1719721507872506090?s=46) gathered a lot of discussion with each side making a lot of good points. Would be great to know opinions of folks in this community. Would be really helpful especially to people deciding what to do after their undergrad maybe. I made a post a while before but I realized the title was kinda misleading so making a new one. submitted by /u/doppler_effects [link] [comments]  ( 9 min )
  • Open

    Dreamer on classic control like CartPole
    Has anyone ever seen a simple implementation of Dreamer on classic environments like CartPole ? There are tons and tons of examples of policy gradients algorithms for CartPole and so, but I haven't seen any results of Dreamer on a simple env like CartPole : a majority of Dreamer implementations (v1,v2,v3) focuses on visual environments, a few say that they are compatible with vector-only envs but don't provide any results nor config to work with these. Typically, I'm looking for a simple implementation that doesn't uses fancy tricks (like parallel processes, etc) and works out of the box on a simple env like CartPole. Of course, I know that Dreamer isn't meant for such simple environments but for educational purposes, I think it's important to start with something as simple as possible. ​ Thank you! submitted by /u/alexandretorres_ [link] [comments]
    Multi Agent PPO in Grid World environment using PettingZoo
    I'm trying to create a simple environment using PettingZoo's parallel API: a 12x12 grid with fixed obstacles. I'm trying to train 3 agents using PPO , with the goal to cover the grid entirely. (Exploration/ Coverage task). Here's the entire colab notebook (with outputs) for the same: https://colab.research.google.com/drive/1yF4aRuQ0eZUIsboaoKZAx8JHgvHKLoof?usp=sharing Now, from what I see from the training statistics, the loss values are steadily decreasing and are converging to zero, which suggests that the model is being trained properly. However, when I evaluate the optimal policy, the results are not very good, with the average rewards of the 3 agents being 6.6, -6 and -6. Also, the environment was a pretty simple one, with just 2 obstacles. I tried testing my custom environment using PettingZoo's parallel api, and it threw an assertion error. (The last cell). The problem could be my environment is not properly formed. How can I debug this? And apart from this, what changes should I make to my PPO training loop ? Changing the architecture is an obvious option but I want to make sure all the basic stuff is correct before doing that. A simple MLP policy should work on such a simple environment. ​ submitted by /u/esem29 [link] [comments]
    CartPole equivalent enviroments in MARL?
    Hi all! I'm learning some MARL and in implementing some algorithms I'd like to test them on some simple environments. In the single agent setting people generally go to something like CartPole. Are there similar environments in MARL? For cooperative/zero-sum/general sum? submitted by /u/1cedrake [link] [comments]
    Parametrization of the Policy in Policy-based Methods
    Why is it the case that people mostly use neural networks to parameterize policies in policy-based methods, rather than probability distributions? Are there situations, where there is a stronger case to use the latter? submitted by /u/MomoSolar [link] [comments]
    "Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models", Fu et al 2023 (self-attention learns higher-order gradient descent)
    submitted by /u/gwern [link] [comments]
  • Open

    I used Blender and Stable Diffusion to make image mixes between human and AI art.
    Title says it all. Mixing these two art worlds is quite fun btw (Spoiler note: I wasn't sure if the second image counts as suggestive due to the bottom clothing, So I marked it just incase. Also, If this gets removed because of the second pic, I completely get it XD) (Watermark note: I've made my reddit account before naming myself CappyAdams/YuriMayori. These images are still created by me and I'm happy to provide proof for those who don't believe me) https://preview.redd.it/a6e1m2zkv6yb1.png?width=3840&format=png&auto=webp&s=451ca2907c2227c844dece80630daf014c1f2272 https://preview.redd.it/he9f3utlv6yb1.png?width=3160&format=png&auto=webp&s=7ac81112f10066429847a23700a3af2268ba3e8d submitted by /u/SonicaNorth [link] [comments]
    What's the cheapest (but nor free) AI chat app that can become your friend / girlfriend/ family?
    Hi! I want to use an app that costs about $6 or less per month to make friends with their different characters. I don't want to pay more. I know there's many, but they're all above $6 / month. submitted by /u/Trainer_Red99 [link] [comments]
    AI one-percenters seizing power forever is the real doomsday scenario, warns AI godfather
    submitted by /u/donutloop [link] [comments]
    AI — weekly megathread!
    News provided by aibrews.com Luma AI introduced Genie, a generative 3D foundation model in research preview. It’s free during research preview via Discord [Details]. Nous Research released Obsidian, the world's first 3B multi-modal model family pre-trained for 4 Trillion tokens that runs locally on iPhones. Obsidian competes in benchmarks withWizardLM-13B and GPT4-X-Vicuna 13B and is based on CapybaraV1.9 [Details]. Phind has released a new model Phind Model V7 that matches and exceeds GPT-4's coding abilities while running 5x faster and having16k context [Details]. Runway released an update for both text to video and image to video generation with Gen-2, bringing major improvements to both the fidelity and consistency of video results [Link]. Stability AI announced [Details]: Sta…
    Is Medicine going to turn into a job where you manage multiple AI/LLM tools to use as CoPilot
    What do you guys think is going to happen. I am a medical student and I have played around with LLMs a lot. Is medicine going to turn into this role where Doctor, Patient, LLMs (not just 1 but multiple agents) all work together for patient care? In the sense of what excel did for accountants, will LLMs do the same for doctors? Not just 1 LLM, but multiple LLM agents interfacing with each other as well working with a doctor in a symbiotic role. Doctors already spend a LOT of time in front of EHRs too. People say medicine will go back about being in person, but I feel like it will go in other direction and be EVEN more computer focused submitted by /u/derpgod123 [link] [comments]
    I made a website where you can ask the same question to GPT-2, GPT-3, ChatGPT and GPT-4, and compare the outputs
    submitted by /u/timegentlemenplease_ [link] [comments]
    Tommorow a new Chat BOT competitor is coming for a select group of people!
    submitted by /u/Unreal_777 [link] [comments]
    Entering AI era, Taiwan chip industry urges renewables push
    submitted by /u/donutloop [link] [comments]
    One-Minute Daily AI News 11/3/2023
    Google today is launching a set of generative AI product imagery tools for advertisers in the U.S. Via the new, AI-powered Product Studio, merchants and advertisers will be able to leverage text-to-image AI capabilities to create new product imagery for free, simply by typing in a prompt of the image they want to use.[1] Ilya Sutskever, the co-founder and chief scientist of OpenAI, envisions a future where humans could merge with machines, and where machines might attain human-like intelligence.[2] Instagram has been spotted developing an “AI friend” feature that users would be able to customize to their liking and then converse with, according to screenshots shared by app researcher Alessandro Paluzzi. Users would be able to chat with the AI to “answer questions, talk through any challenges, brainstorm ideas and much more,” according to screenshots of the feature.[3] Mural on Wednesday released an integration with Microsoft 365 Copilot as well as Mural AI, its native generative AI tool.[4] Sources: [1] https://techcrunch.com/2023/11/01/google-launches-generative-ai-tools-for-product-imagery-to-u-s-advertisers/ [2] https://www.adgully.com/openai-s-ilya-sutskever-unlocks-ai-s-future-138365.html [3] https://techcrunch.com/2023/11/01/instagram-spotted-developing-a-customizable-ai-friend/ [4] https://www.techtarget.com/searchunifiedcommunications/news/366558012/Mural-intros-Mural-AI-integrates-with-Microsoft-365-Copilot submitted by /u/Excellent-Target-847 [link] [comments]
    Telling GPT-4 you're scared or under pressure improves performance
    In a recent paper, researchers have discovered that LLMs show enhanced performance when provided with prompts infused with emotional context, which they call "EmotionPrompts." These prompts incorporate sentiments of urgency or importance, such as "It's crucial that I get this right for my thesis defense," as opposed to neutral prompts like "Please provide feedback." The study's empirical evidence suggests substantial gains. This indicates a significant sensitivity of LLMs to the implied emotional stakes in a prompt: Deterministic tasks saw an 8% performance boost Generative tasks experienced a 115% improvement when benchmarked using BIG-Bench. Human evaluators further validated these findings, observing a 10.9% increase in the perceived quality of responses when EmotionPrompts were used. This enhancement is attributed to the models' capacity to detect and prioritize the heightened language patterns that imply a need for precision and care in the response. The research delineates the potential of EmotionPrompts to refine the effectiveness of AI in applications where understanding the user's intent and urgency is paramount, even though the AI does not genuinely comprehend or feel emotions. TLDR: Research shows LLMs deliver better results when prompts signal emotional urgency. This insight can be leveraged to improve AI applications by integrating EmotionPrompts into the design of user interactions. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]
    Back propagation alternatives
    I understand that before back propagation was developed there were other methods used such as hebbian learning, and admittedly I know nothing about these old methods. But as I've learned about back prop in wondering is there a line of research working on alternatives? It seems amazing but also so highly incremental and blind that I wonder if there's a better way. One of it's major drawbacks is the fact that the information must pass through the entire structure rather than getting immediate feedback. Anyway, thanks! submitted by /u/Stack3 [link] [comments]
  • Open

    Fax machines in the 21st century
    There are still tens of millions of fax machines still exist. My business line gets calls from modems and fax machines fairly often. Maybe my number is close to that of a fax machine. Fax machines and health care Fax machines are especially common in health care. I remember when I was working at MD […] Fax machines in the 21st century first appeared on John D. Cook.  ( 6 min )
    Blog RSS feed
    I got an email from someone saying the RSS feed for this site stopped working. Anyone else having this problem? I subscribe to my RSS feed and it’s working fine for me. It may be that there are variations on the RSS feed, and the version I’m using works while the variation some others use […] Blog RSS feed first appeared on John D. Cook.  ( 5 min )
    Solitons and the KdV equation
    Rarely does a nonlinear differential equation, especially a nonlinear partial differential equation, have a closed-form solution. But that is the case for the Korteweg–De Vries equation. (Technically I should say it’s rare for a naturally-occurring nonlinear differential equation to have a closed-form solution. You can always start with a solution and cook up a contrived […] Solitons and the KdV equation first appeared on John D. Cook.  ( 6 min )
  • Open

    Best of both worlds: Achieving scalability and quality in text clustering
    Posted by Sara Ahmadian and Mehran Kazemi, Research Scientists, Google Research Clustering is a fundamental, ubiquitous problem in data mining and unsupervised machine learning, where the goal is to group together similar items. The standard forms of clustering are metric clustering and graph clustering. In metric clustering, a given metric space defines distances between data points, which are grouped together based on their separation. In graph clustering, a given graph connects similar data points through edges, and the clustering process groups data points together based on the connections between them. Both clustering forms are particularly useful for large corpora where class labels can’t be defined. Examples of such corpora are the ever-growing digital text collections of variou…  ( 92 min )
  • Open

    From knowledge graph discussion group to think tank
    For several years now, I’ve helped to organize and moderate a discussion forum focused on knowledge graphs, knowledge modeling and related topics. The group began as an effort to help each other build personal knowledge graphs.  Two of our members from that effort – George Anadiotis and Ivo Velichkov – compiled and edited a multi-author… Read More »From knowledge graph discussion group to think tank The post From knowledge graph discussion group to think tank appeared first on Data Science Central.  ( 21 min )
  • Open

    ‘Starship for the Mind’: University of Florida Opens Malachowsky Hall, an Epicenter for AI and Data Science
    Embodying the convergence of AI and academia, the University of Florida Friday inaugurated the Malachowsky Hall for Data Science & Information Technology. The sleek, seven-story building is poised to play a pivotal role in UF’s ongoing efforts to harness the transformative power of AI, reaffirming its stature as one of the nation’s leading public universities. Read article >  ( 6 min )
    ‘Starship for the Mind’: University of Florida Opens Malachowsky Hall, an Epicenter for AI and Data Science
    Embodying the convergence of AI and academia, the University of Florida Friday inaugurated the Malachowsky Hall for Data Science & Information Technology. The sleek, seven-story building is poised to play a pivotal role in UF’s ongoing efforts to harness the transformative power of AI, reaffirming its stature as one of the nation’s leading public universities. Read article >  ( 6 min )
    How AI-Based Cybersecurity Strengthens Business Resilience
    The world’s 5 billion internet users and nearly 54 billion devices generate 3.4 petabytes of data per second, according to IDC. As digitalization accelerates, enterprise IT teams are under greater pressure to identify and block incoming cyber threats to ensure business operations and services are not interrupted — and AI-based cybersecurity provides a reliable way Read article >  ( 11 min )
    How AI-Based Cybersecurity Strengthens Business Resilience
    The world’s 5 billion internet users and nearly 54 billion devices generate 3.4 petabytes of data per second, according to IDC. As digitalization accelerates, enterprise IT teams are under greater pressure to identify and block incoming cyber threats to ensure business operations and services are not interrupted — and AI-based cybersecurity provides a reliable way Read article >  ( 11 min )
  • Open

    Bundesliga Match Facts Shot Speed – Who fires the hardest shots in the Bundesliga?
    There’s a kind of magic that surrounds a soccer shot so powerful, it leaves spectators, players, and even commentators in a momentary state of awe. Think back to a moment when the sheer force of a strike left an entire Bundesliga stadium buzzing with energy. What exactly captures our imagination with such intensity? While there […]  ( 10 min )

  • Open

    What are some of the coolest AI use cases you've tested so far? AI tutoring is something that I actually found myself using almost daily
    submitted by /u/Playdonifps [link] [comments]
    how can we be sure AI won't rebel against humans in the future?
    basically the title, how can we be sure AI won't have self awareness and won't rebel against humans? submitted by /u/lilshoegazecat [link] [comments]
    Same Images In Runway Gen 2 From 3 Months Ago VS Now (default options)
    submitted by /u/SuspiciousPillbox [link] [comments]
    Teen boys use AI to make fake nudes of classmates, sparking police probe
    Teen boys at Westfield High School in New Jersey used AI image generators to create and share fake nude photos of female classmates, sparking a police investigation. The school believed the images had been deleted, but it remains unclear how many students were affected or if any disciplinary action was taken. There is currently no federal law restricting the creation of faked sexual images, but some states have passed laws to outlaw the distribution of faked porn. President Joe Biden has issued an executive order urging lawmakers to pass protections against generative AI producing child sexual abuse material. New Jersey may strengthen its laws to criminalize the creation and sharing of AI-faked nudes. Source : https://arstechnica.com/tech-policy/2023/11/deepfake-nudes-of-high-schoolers-spark-police-probe-in-nj/ submitted by /u/NuseAI [link] [comments]
    text to 3D in 10 seconds (Mickey, Miney, Bart, Tom Holland, Cristiano Ronaldo?) (workflow in comments)
    submitted by /u/PeePeePeePooPooPooo [link] [comments]
    Benchmarking machine learning frameworks
    submitted by /u/mfilion [link] [comments]
    We don't want it ~
    submitted by /u/Unreal_777 [link] [comments]
    LinkedIn just launched a new AI job coach for Premium members
    submitted by /u/thisisinsider [link] [comments]
    Harmonizing the Future: AI, Music, and Crypto Revolution
    submitted by /u/Einsof__ [link] [comments]
    Searching for an internship in AI for my bachelor thesis
    I am a Belgian student currently studying applied informatics with a specialization in AI. We learn everything from machine learning to generative AI, and have a focus on integrating these into actual solutions. Next semester I am required to do an internship from March 25th till June 19th. During this internship I am also required to work on and write my bachelor thesis. The main problem now is that there are very little companies that have contacted the school with internship positions related to AI. So I came here in the hopes that some of you may know companies that are willing to offer an internship position. Either in Belgium or an international company offering remote work. My preference goes out to something in research or innovative, but I am open to do any AI related work. If it is something I have little experience in I will learn. I will continue to search myself, but thanks in advance for any help! submitted by /u/ETS_Green [link] [comments]
    How can I generate accurate words and sentences in Midjourney?
    I’m currently using the pro version, though I’m extremely new to using it and have seen where you can add modifiers? Has anyone had any success with creating sentences or typography? If so do you have a method you use? submitted by /u/Maelasae [link] [comments]
    New Order - Blue Monday (AI music visualization)
    submitted by /u/glenniszen [link] [comments]
    Could Socratic Dialogue Evolve into a Hacking Technique for AI Systems?
    submitted by /u/utku1337 [link] [comments]
    Combination of each Star Wars trooper's helmet, from the Old Republic to the Final Order, into one
    submitted by /u/MomusVult [link] [comments]
    Convinced yet?
    submitted by /u/Philipp [link] [comments]
    What did humans lose by gaining intelligence?
    What did humans lose by gaining intelligence? submitted by /u/Virtual-Study-Campus [link] [comments]
    The art of color
    submitted by /u/Sea_Permit5660 [link] [comments]
    What is your approach to continuous testing and integration?
    If your answer is not below the given options, you can share in the comment section. I would appreciate your answers and suggestions. View Poll submitted by /u/Cygnet-Digital [link] [comments]
    Role Of AI In Business: Benefits And Challenges .
    AI is not a far-off pipe dream anymore; it has already become a precious resource for companies, helping them save time and reduce costs. However, there is still widespread confusion about how to effectively use Artificial Intelligence in businesses. Many are contemplating how to harness this technology for innovation, scalability, and improvement. If you are among those thinking this, then there’s no reason to look further because, in this piece, you are going to learn about it. Ready to dig deeper into these AI trends and understand how they’ll shape your industry in 2024? Dive into our blog for comprehensive insights. 👉 https://invozone.com/blog/ai-in-business/ submitted by /u/InvoZone [link] [comments]
    AI on weight gain from the 1950s to today
    Here's a progression timeline of the obesity epidemic, with a focus on quantifying weight gain: 1950s-1960s: - Initial Changes: During this period, the average American adult gained approximately 10 pounds compared to their counterparts from the early 1900s. - TV's Sedentary Effect: Hours of TV watching correlated with a slight uptick in average body weight. 1970s: - Fast Food's Caloric Boom: Regular consumption added an estimated 200-300 extra calories per day to many individuals' diets, leading to potential weight gains of 20-30 pounds a year if not offset by exercise. - Shift in Work: The move to sedentary jobs meant many adults were burning 100-200 fewer calories per day, leading to an additional potential weight gain of 10-20 pounds a year. 1980s: - Processed Food Surge: The averag…
    What do you guys expect from the OpenAI developer conference on November 6 ?
    I would guess some API access stuff, nothing more. submitted by /u/Mission-Length7704 [link] [comments]
    One-Minute Daily AI News 11/2/2023
    Shopify (SHOP.TO) has to prove to investors that its AI products will spark growth when it reports results on Thursday. Wall Street is expecting it to show revenue growth of 22.38% to $1.67 billion compared to last year, according to estimates from LSEG.[1] At a U.K. summit, 28 governments, including China and the U.S., signed a declaration agreeing to cooperate on evaluating the risks of artificial intelligence.[2] AMD’s MI300 Chips Projected to Make $1 Billion in Sales, Challenging Nvidia’s Dominance.[3] Scarlett Johansson demands AI app stop using her likeness in an ad without her permission.[4] Sources: [1] https://www.reuters.com/business/retail-consumer/shopify-merchants-seek-ai-boost-key-sales-decisions-2023-11-01/ [2] https://www.nytimes.com/2023/11/01/world/europe/uk-ai-summit-sunak.html [3] https://gameishard.gg/news/amd-rises-as-ai-chip-sales-prediction-bodes-well-for-rivalry-with-nvidia-by-reuters/535905/ [4] https://www.nbcnews.com/tech/scarlett-johansson-legal-action-ai-app-rcna123248 submitted by /u/Excellent-Target-847 [link] [comments]
    text to 3D
    submitted by /u/PeePeePeePooPooPooo [link] [comments]
  • Open

    UK AI Safety Summit 2023: Scaling up the Future
    submitted by /u/engaged_ape [link] [comments]
  • Open

    Zero-shot adaptive prompting of large language models
    Posted by Xingchen Wan, Student Researcher, and Ruoxi Sun, Research Scientist, Cloud AI Team Recent advances in large language models (LLMs) are very promising as reflected in their capability for general problem-solving in few-shot and zero-shot setups, even without explicit training on these tasks. This is impressive because in the few-shot setup, LLMs are presented with only a few question-answer demonstrations prior to being given a test question. Even more challenging is the zero-shot setup, where the LLM is directly prompted with the test question only. Even though the few-shot setup has dramatically reduced the amount of data required to adapt a model for a specific use-case, there are still cases where generating sample prompts can be challenging. For example, handcrafting…  ( 93 min )
  • Open

    [R] Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models
    Paper: https://arxiv.org/abs/2310.17086 Abstract: Transformers are remarkably good at in-context learning (ICL) -- learning from demonstrations without parameter updates -- but how they perform ICL remains a mystery. Recent work suggests that Transformers may learn in-context by internally running Gradient Descent, a first-order optimization method. In this paper, we instead demonstrate that Transformers learn to implement higher-order optimization methods to perform ICL. Focusing on in-context linear regression, we show that Transformers learn to implement an algorithm very similar to Iterative Newton's Method, a higher-order optimization method, rather than Gradient Descent. Empirically, we show that predictions from successive Transformer layers closely match different iterations of Newton's Method linearly, with each middle layer roughly computing 3 iterations. In contrast, exponentially more Gradient Descent steps are needed to match an additional Transformers layer; this suggests that Transformers have an comparable rate of convergence with high-order methods such as Iterative Newton, which are exponentially faster than Gradient Descent. We also show that Transformers can learn in-context on ill-conditioned data, a setting where Gradient Descent struggles but Iterative Newton succeeds. Finally, we show theoretical results which support our empirical findings and have a close correspondence with them: we prove that Transformers can implement k iterations of Newton's method with O(k) layers. ​ https://preview.redd.it/i6hdcx1v60yb1.jpg?width=2036&format=pjpg&auto=webp&s=ed95fc0b625878ed88c3f36baa9ea3fb07430ff7 submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [P] Get alerts when your AI fails (and get a gift card too!)
    Hey folks — I’m working on a platform that allows you to set up meaningful, automated tests and experiment tracking for your LLMs in just a few minutes (our tests go well beyond just the usual latency, token usage & cost stuff). If you’ve tried shipping LLMs and haven’t run into any issues with performance or trustworthiness, leave a comment on what your use-case is / how you did it! If not, here’s a free sign-up link to our app: https://app.openlayer.com. P.S. Will send a $50 amazon gift card your way if you’re interested in giving me additional feedback afterwards (30 min call). Just send me an email at [gabriel@openlayer.com](mailto:gabriel@openlayer.com). submitted by /u/byebaybay [link] [comments]  ( 9 min )
    [P] Benchmarking machine learning frameworks
    MLBench enables developers and maintainers to effortlessly gauge how their frameworks perform compared to other implementations, prior code versions, or across different boards, with respect to both runtime performance and other metrics. https://www.collabora.com/news-and-blog/news-and-events/benchmarking-machine-learning-frameworks.html submitted by /u/mfilion [link] [comments]  ( 9 min )
    [D] One LLM won't rule them all
    There isn't going to be one LLM to rule them all and here's why: https://generatingconversation.substack.com/p/one-llm-wont-rule-them-all submitted by /u/cgwuaqueduct [link] [comments]  ( 9 min )
    [D] Can somebody share their experience of attending ICML conference?
    I'm planning to attend ICML 2024 in person. Can somebody share their experience of attending the conference? Is it worth attending if you don't have any paper to present? If yes, how to get the most out of it? submitted by /u/cpluscplus [link] [comments]  ( 9 min )
    [D] Could it be advantageous to train/finetune some model (actually a LLM) feature by feature?
    This may be weird question, but I have tried looking for resources and haven't found anything at all. We are trying to train a classifier from some data using a pretrained LLM. The data consist of several features which are text, so we decided at some point to concatenate them in a string with the proper connectors and use as input the whole string. For instance, think about data of some product: "name", "manufacturer", "specs", etc. Then we create a string as "The product televisor from this manufacturer which have the following specs: ...". In this case the problem would be to decide the category of the product. For our particular case, our model makes some critical mistakes and it seems that they stem from not noticing that some features are critical (following the previous example, the manufacturer for instance). We were wondering if it would be beneficial to train the model first only using the substring corresponding to the feature that we think is more important and add features to the training little by little. I am a bit worried that this may lead to a bad local minimum and the model gets stuck there. Has anybody seen or done anything similar or has any reason why this would or wouldn't work? submitted by /u/soloetc [link] [comments]  ( 9 min )
    [R] GRACE: Discriminator-Guided Chain-of-Thought Reasoning
    TLDR: The paper proposes a decoding approach that improves multi-step (chain-of-thought) reasoning by using a discriminator to score and guide the generation of correct reasoning steps. It outperforms self-consistency and verifiers on various tasks and enhances both final answer accuracy and intermediate reasoning correctness. ​ Paper: arxiv.org/abs/2305.14934 Code: https://github.com/mukhal/grace/ ​ ​ submitted by /u/moyle [link] [comments]  ( 9 min )
    [D] From i5 10400 to i5 11400 or Another monitor for dual monitor set up
    My current CPU is i5 10400. For ML will i5 10400 bottleneck rtx 3060 12gb? heard it will because that i5 supports only 3rd Gen, then should I upgrade to i5 11400 just to get the support of Gen 4 or instead I should just buy another monitor for dual monitor set up? submitted by /u/speed-speed [link] [comments]  ( 9 min )
    [D] benefits of using only attention weights for LoRA
    I'm confused as to why people would only use a subset of weights as learnable parameters for LoRA. If you are only using attention weights as update params, you still need the decomposed weights for the other layers to get the derivative of the loss with respect to the attention weights. That's how the chain rule works, so I don't see how it would help with memory consumption. Is there something I'm missing here? submitted by /u/skelly0311 [link] [comments]  ( 9 min )
    [D] Unsupervised feature selection
    Hello, I am doing a simulation study to see how associations change when aggregating spatial data. I have a large number of continuous exposure variables to choose from (>100). Before creating my models, I want to choose a group of meaningful exposures that limit collinearity (maybe ~10). The outcome variable will be simulated, and I want to choose the exposures without regard to the outcome (therefore I believe this is unsupervised learning). What is the best way to select these features? How many features should I select? Thank you! submitted by /u/DefinitelyAmNotOP [link] [comments]  ( 9 min )
    [D] Thoughts on Masked Language Modeling Objective and Corrupted Spans for Causal LM's?
    Google has published a number of papers showing their increasing affinity for what was originally called a masked language modeling objective and has now become corrupted spans. I first saw this in the T5 paper Later in the UL2 paper "What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?" And finally in Transcending Scaling Laws with 0.1% Extra Compute However, I don't see this mentioned in this sub or in any really in any discussions on LLM training, nor do I find documentation for it in any LLM training frameworks. Is this something any of you have used? Are you aware of any open source tools or code examples of this? One thing that confuses me is that these papers all discuss Prefix-LMs as being important in this process, though I'm unsure what they mean. I know a Prefix-LM has a portion of it's inputs with bi-directional attention, but it's not something any actual models do as far as I can tell. But, in the "Transcending Scaling Laws" paper it seems like they take PaLM training checkpoints and finish off the pre-training with their corrupted span method by converting PaLM to a Prefix-LM. They say: We train U-PaLM using the prefix language model (PrefixLM) architecture, also sometimes known as a non-causal decoder-only model. The PrefixLM architecture keeps a non-causal mask in its prefix (or inputs) and applies bidirectional attention to input tokens. Is it really possible to take a causal LM and turn it into a prefix LM (partially bi-directional) by just changing the attention mask with no impact on the learned weights? Or do they mean they are training on completions only as described in the hugging face documentation submitted by /u/elbiot [link] [comments]  ( 9 min )
    [D] Imputing the Testing data after combining the Training and Validation datasets
    Hello, I imputed my training data's missing values with the mean of each column, but my question is, after combining the training and validation datasets and re-training the model, when we go on to test the model on the testing dataset, should we impute the testing data's missing values with the old training data's mean values, or the newly combined (train and validation) dataset's means? submitted by /u/CrunchyMind [link] [comments]  ( 9 min )
    [Research] Detecting Annotation Errors in Semantic Segmentation Data
    Would you trust medical AI that’s been trained on pathology/radiology images where tumors/injuries were overlooked by data annotators or otherwise mislabeled? Most image segmentation datasets today contain tons of errors because it is painstaking to annotate every pixel. Example of bone shard not labeled properly. After substantial research, I'm excited to introduce support for segmentation in cleanlab to automatically catch annotation errors in image segmentation datasets, before they harm your models! Quickly use this new addition to detect bad data and fix it before training/evaluating your segmentation models. This is the easiest way to increase the reliability of your data & AI! We have feely open-sourced our new method for improving segmentation data, published a paper on the research behind it, and released a 5-min code tutorial. You can also read more in the blog if you'd like. submitted by /u/cmauck10 [link] [comments]  ( 9 min )
    [D] AAAI 2024 Reviews
    Reviews are out on CMT. What did you get? submitted by /u/TheTeoz [link] [comments]  ( 8 min )
    [R] help with RFE in R
    Hi everyone, I’m working on a bioinformatics project for school and I’m having some trouble doing the RFE in R. I’ve got an affymetrix gene expression set that I’ve done preprocessing on and I’m at the point where I’m ready to do RFE but I’m struggling writing the code for this (my first iteration took 30 hours to run and I screwed it up still… gave me my sample names as top variables instead of my probe sets). With that being said, the goal is to mine a bio marker from the affymetrix data through SVM&RF RFE. I’d also like to investigate different bio marker sizes (2 probe sets, 50, 100, all the way up to the max number of probes I have) I’m a biochemist by education got thrown into R and am struggling to learn it. I will say, it has been kinda fun tho. Thank you in advance, if any of you are in the central CT, USA region I’ll buy ya a beer! — rusty submitted by /u/RustyShackleford2677 [link] [comments]  ( 9 min )
    [D] A question about Nvida Nemo + Getty Images
    Getty is pitching their new Generative AI tool in our company, it's based on Nvidia Nemo and their "unique selling point" is that the model is fully trained with Getty images. This sounds a bit odd to me, but I'm not sure if it is actually possible to have a model only trained with proprietary images. The legal team assumes this as a true statement but I still think this is some sort of finetuned version of NVIDIA Nemo model. Does any of you have any clue on where to look to make sure we are not being a bit too gullible. Thank you in advance. submitted by /u/legado [link] [comments]  ( 9 min )
    [D] Open Source Machine Learning Projects
    Hey, So i consider myself as a beginner in ML and i was wondering if there are open source projects i could contribute to, to apply my knowledge and get a real world experience of how these things are built an. The only projects I could find are labs or libraries and not products used in real life. Does any one have any experience contributing to open source projects? submitted by /u/parvpareek [link] [comments]  ( 9 min )
    [R] Facebook Research archived hand-related repos
    I find that recently, Facebook Research archived (at lease some) their repos about hand-related research. Even very recent research: https://github.com/facebookresearch/PressureVision Or very popular ones: https://github.com/facebookresearch/ContactPose Does anybody have any idea why this happend? Or is it a mistake? submitted by /u/Crow-Scare [link] [comments]  ( 9 min )
    [D] Tools for creating ASR datasets
    [D] please can anyone recommend tools for creating ASR datasets? submitted by /u/afrodata [link] [comments]  ( 9 min )
    Separate model for outliers [R] [outliers] [regression]
    Hey Here is a situtation i found myself. There is a dataset with around 4K observations with near 10% outliers in target variable. Transformations like log, box-cox, winsorize didn't work out. Robust regression approaches didn't help either.Model performance metrics with and wo those outliers are 4x worse. Note Just removing those observations is not considered Here is my plan. Build another model aimed to detect the outliers first, model for "regular" observations and model for outliers. Does it make sense and i'm not overcomplicating? What is the common approach in such cases? Any help and ideas are highly apreciated. Thanks in advance submitted by /u/No_Purchase8883 [link] [comments]  ( 9 min )
    [D] AAAI 24 Reviews
    Creating this thread in anticipation of the upcoming Phase 2 reviews and results. If there already is a thread, please share! ​ submitted by /u/tallguyfromstats [link] [comments]
    [D] Career Path for pursuing job in Machine Learning and Operations Research
    I am a final-year Industrial Engineering Masters student at Purdue University. My major areas of interest are Machine Learning and Operations Research. I have an option to pursue a Dual Master with other Masters in Electrical and Computer Engineering. Should I choose that path to get better job opportunities ? Will the other degree help me in the long term? [D] submitted by /u/pulkit_mundra [link] [comments]  ( 9 min )
  • Open

    2023-24 Takeda Fellows: Advancing research at the intersection of AI and health
    Thirteen new graduate student fellows will pursue exciting new paths of knowledge and discovery.  ( 14 min )
    Generating opportunities with generative AI
    Rama Ramakrishnan helps companies explore the promises and perils of large language models and other transformative AI technologies.  ( 10 min )
  • Open

    Deploy ML models built in Amazon SageMaker Canvas to Amazon SageMaker real-time endpoints
    Amazon SageMaker Canvas now supports deploying machine learning (ML) models to real-time inferencing endpoints, allowing you take your ML models to production and drive action based on ML-powered insights. SageMaker Canvas is a no-code workspace that enables analysts and citizen data scientists to generate accurate ML predictions for their business needs. Until now, SageMaker Canvas […]  ( 6 min )
    Develop generative AI applications to improve teaching and learning experiences
    Recently, teachers and institutions have looked for different ways to incorporate artificial intelligence (AI) into their curriculums, whether it be teaching about machine learning (ML) or incorporating it into creating lesson plans, grading, or other educational applications. Generative AI models, in particular large language models (LLMs), have dramatically sped up AI’s impact on education. Generative […]  ( 8 min )
  • Open

    How Are Foundation Models Used in Gaming?
    AI technologies are having a massive impact across industries, including media and entertainment, automotive, customer service and more.  ( 8 min )
    How Are Foundation Models Used in Gaming?
    AI technologies are having a massive impact across industries, including media and entertainment, automotive, customer service and more.  ( 8 min )
    GeForce NOW-vember Brings Over 50 New Games to Stream In the Cloud
    Gear up with gratitude for more gaming time. GeForce NOW brings members a cornucopia of 15 newly supported games to the cloud this week. That’s just the start — there are a total of 54 titles coming in the month of November. Members can also join thousands of esports fans in the cloud with the Read article >  ( 8 min )
    GeForce NOW-vember Brings Over 50 New Games to Stream In the Cloud
    Gear up with gratitude for more gaming time. GeForce NOW brings members a cornucopia of 15 newly supported games to the cloud this week. That’s just the start — there are a total of 54 titles coming in the month of November. Members can also join thousands of esports fans in the cloud with the Read article >  ( 8 min )
  • Open

    What architecture for vision-based RL?
    Hello dear community, Someone has just asked me this question and I have been unable to provide a satisfactory answer, as in practice I have been using very simple and quite naive CNNs for this setting thus far. I think I read a couple papers a while back that were advocating for specific types of NNs to deal with vision-based RL specifically, but I forgot. So, my question is: what are the most promising NN architectures for pure vision-based (end-to-end) RL according to you? Thanks :) submitted by /u/yannbouteiller [link] [comments]
    Comparing RL vs LLM Prompting for Game Playing AI
    Hey people, I wanted to share a video from my ML YouTube channel discussing the state of the art methods for game playing AI systems post the LLM boom. Of course, this space was dominated by Reinforcement Learning for most of the 2010s, but there has been some interesting work towards using LLMs solo or as an “RL assistant” to train better RL agents. Some of the papers I talked about in the video seem to indicate that LLMs can guide RL exploration at the start of training to drastically improve sample efficiency. Here’s my video breaking down the complex prompting systems that let LLMs like GPT4 play Minecraft-like open world games and reflect on their progress. Hope people who are interested find it worthwhile… submitted by /u/AvvYaa [link] [comments]
  • Open

    A disk around Paris
    The other day I saw an image of a large disk centered on Paris subjected to the Mercator projection. I was playing around in Mathematica and made similar images for different projections. Each image below is a disk of radius 4200 km centered on Paris (latitude 49°, longitude 2°). All images were produced with the […] A disk around Paris first appeared on John D. Cook.  ( 5 min )
  • Open

    A Video Game that Pays: Lessons Learned from Working Remotely
    The original tweet by Sahil Lavingia Some time ago, I stumbled upon this amusing tweet from @shl. I love how true this statement is, although it feels very wrong to admit it. Let’s think about the day in the life of a software engineer: You interact with people from all corners of the world. You complete tasks on different online platforms (Slack, GitHub, VS Code, Google Docs, etc.). You have a list of your main quests to complete (you can find them in your journ… Kanban board!) There are also side-quests to take care of (“Hey Damian, could you take a look at this bug?”). Sometimes you can gang up with your teammates to slay a big beast (“Hey, I have this nasty bug that I have been working on for the past few days. I know you have better knowledge of this particular part of the repositor…  ( 8 min )

  • Open

    [D] Self Attention from First Principles (A video)
    Hello guys, I wanted to post a video I have been working on for my Deep Learning YT channel about Self Attention and Masked Self Attention. In the video, I tried to explain the essence of self-attention in an intuitive manner, describe how it works in practice, why it works how it works, its various strengths and applications… I’m kinda excited to share the video here for those that are interested. https://youtu.be/4naXLhVfeho submitted by /u/AvvYaa [link] [comments]  ( 9 min )
    [R] Zephyr: Direct Distillation of LM Alignment - state-of-the-art for 7B parameter chat models
    Paper: https://arxiv.org/abs/2310.16944 GitHub: https://github.com/huggingface/alignment-handbook Hugging Face: https://huggingface.co/collections/HuggingFaceH4/zephyr-7b-6538c6d6d5ddd1cbb1744a66 X thread: https://twitter.com/Thom_Wolf/status/1717821614467739796 Abstract: We aim to produce a smaller language model that is aligned to user intent. Previous research has shown that applying distilled supervised fine-tuning (dSFT) on larger models significantly improves task accuracy; however, these models are unaligned, i.e. they do not respond well to natural prompts. To distill this property, we experiment with the use of preference data from AI Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models, and requires no human annotation. In particular, results on MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access RLHF-based model. Code, models, data, and tutorials for the system are available at this https URL. ​ https://preview.redd.it/4y355lxv0txb1.jpg?width=1200&format=pjpg&auto=webp&s=76e7b8a2ff06e39e9189712a42b1e349423b5d3d ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] how to understand structural equivalence discovered by Node2Vec?
    Hi, I am new to graph neural networks and have some issue understanding structural equivalence discovered by Node2Vec. For instance, given following plot generated by Node2Vec visualization (taken from https://towardsdatascience.com/complete-guide-to-understanding-node2vec-algorithm-4e9a35e5d147): ​ https://preview.redd.it/kqjiiywelsxb1.png?width=500&format=png&auto=webp&s=6ad9d86f6f0a6dbe1a5d641272a5956d110c8770 They claim that if we encourage BFS search in Node2Vec, then we are more likely to discover structural equivalence patterns in the bottom plot. What I do not quite get is if we encourage BFS, would discovered embeddings be more "local"? If so, why those distant nodes would have similar colors/embeddings? Any idea would be much appreciated, thanks! submitted by /u/Illustrious-Pay-7516 [link] [comments]  ( 9 min )
    [D] Need Help About Height Map Analysis
    As mentiones in the title, I need help with height map analysis. I want to create an artificial intelligence that can analyze the height map given as an input and mark areas such as mountainous areas, river beds, plains, rocks, cliffs and other geographical details on the map. Is there any advice you can give me or could you provide suggestions which will help me to move forward? I really am insterested in this project and want to work on it. Also, if you know any examples and/or studies related to this subject, can you please share them? I am looking forward to discovering more information regarding this topic. Edit: Spelling errors submitted by /u/PlayerWell [link] [comments]  ( 9 min )
    [D]ai that can remove emojis/spiderman filter from face
    recently i saw an ai where it said it can remove emoji which was obviously fake but it made me curious is there an ai that can do this? because i think ai cant do it is because the face does not exist in the image so ai have no way of showing the face behind emoji submitted by /u/randomaccimade69 [link] [comments]  ( 9 min )
    [D] With LLMs hallucinating nature, how do we create a credible production ready application?
    I want to use LLMs to automate analysing data and use it to provide insights to my users, but often times I notice insights being generated on factually incorrect data. I tried fine tuning my prompts, the structure in which I pass data to LLM, few shot learning but there still some chance of it to hallucinate. How can I create a production ready application where this insights are surfaced to end users and presenting incorrect insights is not accepted? I am out of ideas. Any guidance is appreciated 🙏🏻 submitted by /u/software-n-erd [link] [comments]  ( 9 min )
    [D] professionally, is data collection carried within a notebook?
    I am building my first model and I am about to start the data collection process, building a csv dataset, but I do not know whether to do this within a python script, the jupyter notebook where i will write the model, or a separate notebook. I have tried researching this but I have not found a concise answer to my question. Thanks in advance. submitted by /u/obvslynot [link] [comments]  ( 9 min )
    [D] What do y'all think about Biden's new AI regulation?
    Would it stifle the opensource development and new AI startups and only benefit the established big tech companies? It's kinda vague and I couldn't understand in my first read but does the executive order lacks teeth? That is if the guideline is not followed closely can government do anything to opensource community or new startups? limk to one of articles (no paywall) : https://www.reuters.com/technology/white-house-unveils-wide-ranging-action-mitigate-ai-risks-2023-10-30/ submitted by /u/ColumbiaGSAlum [link] [comments]  ( 9 min )
    [P] LLM-VM: Fine-tune anywhere & avoid big models.
    I wanted to share a project I’ve been working on, LLM-VM. It’s a community-first, open-source tool designed to enhance the efficiency of fine-tuning and inference for large language models (LLMs) both locally and in cloud environments. At its core, LLM-VM implements recursive synthesized distillation with automatic task discovery. This means it can iteratively refine training data and model parameters, aiming to optimize model performance with less computational overhead. Our goal with LLM-VM is to provide a practical and accessible platform for researchers and developers. By facilitating more efficient model training and deployment, we hope to contribute to the broader machine learning community’s efforts in advancing language model capabilities. I’d love to get your feedback, contributions, or any thoughts you might have. Let’s collaborate to push the boundaries of what we can achieve with LLMs! Cheers! submitted by /u/mmirman [link] [comments]  ( 9 min )
    [P] LangCheck: a multi-lingual toolkit to evaluate LLM applications
    Hi! I wanted to share LangCheck, an open source toolkit to evaluate LLM applications (GitHub, Quickstart). It already supports English and Japanese text, and more languages soon – contributions welcome! Core functionality: langcheck.metrics – metrics to evaluate quality & structure of LLM-generated text langcheck.plot – interactive visualizations of text quality langcheck.augment – text augmentations to perturb prompts, references, etc (coming soon) Super open to feedback & curious how other people think about evaluation for LLM apps. submitted by /u/kennysong [link] [comments]  ( 9 min )
    [D] Data Pipelines for Data Products
    Data pipelines are one of the key components of an ML product. Creating value from different resources only makes sense when it is available to the consumers. In the article, you will explore the most important elements of a data pipeline that fulfils the data product needs, and you will get practical guidelines to incorporate in your use cases. Here's the article: https://moderndata101.substack.com/p/data-pipelines-for-data-products submitted by /u/growth_man [link] [comments]  ( 9 min )
    [Research], [R], [Project], [P] Dataset for evaluate algorithms on sensor calibration
    Hi everyone. I should evaluate the precision and accuracy of some machine learning algorithms to calibrate a sensor. In particular I should compare these algorithms and choose the best one to calibrate a sensor, to proceed to obtain a transfer function. I did various searches on known sites and repositories but found very little. In particular, I would need datasets from 3 different sensors, in order to test the various algorithms on different sensors. Therefore each dataset must have the data collected by the sensor and the target data of the sensor itself, so that it can be calibrated and to be able to evaluate the calibration algorithms. Can you help me? submitted by /u/Calosss22 [link] [comments]  ( 9 min )
    [N] Webinar - Enable and manage Vector Search in MongoDB Atlas with SuperDuperDB
    Hi Community! Today we are hosting a hands-on "Enable and manage Vector Search in MongoDB Atlas with SuperDuperDB Webinar": https://www.eventbrite.com/e/enable-and-manage-vector-search-in-mongodb-atlas-with-superduperdb-webinar-tickets-744936223297 The following questions will be answered in the workshop: · What is vector search and why is it so important? · What are vector databases? · Why is it a huge advantage to use vector search with MongoDB Atlas instead of a vector database? · What embedding models are there? · How do I use these models to generate vector embeddings for my data? · How do I perform vector search? · What AI applications can I build on top of vector search? When? Wednesday, November 1st 12pm - 1pm ET (Eastern Standard Time) Add to Google · Outlook · iCal · Yahoo submitted by /u/Sevyten [link] [comments]  ( 9 min )
    [D] Rewrite with many style using AI
    Hi everyone. Recently I have done some research to build a writing AI tool such as Quillbot. I see them do really amazing in paraphrase. I am currently trying to learn and research some articles to create a Deep learning model that can change many different writing styles and allow users to customize it themselves. I tried starting with small models and a few pretrained models. I find that models are trained to specialize in only one writing style or the models simply rewrite without regard to the specific style. Next, I tried LLM models like mistral 7B or Llama-2 and using the results, I self-assessed them to be somewhat better. However, I want more than that. Is there a way to create many different writing styles using only one model and we can scale it with more styles? and whether anyone has applied it to real products. If you can, could you suggest me some more related research? submitted by /u/HughLee_1999 [link] [comments]  ( 9 min )
    [R] LLMs may Dominate Information Access: Neural Retrievers are Biased Towards LLM-Generated Texts
    [arXiv] https://arxiv.org/abs/2310.20501 [Abstract] Recently, the emergence of large language models (LLMs) has revolutionized the paradigm of information retrieval (IR) applications, especially in web search. With their remarkable capabilities in generating human-like texts, LLMs have created enormous texts on the Internet. As a result, IR systems in the LLMs era are facing a new challenge: the indexed documents now are not only written by human beings but also automatically generated by the LLMs. How these LLM-generated documents influence the IR systems is a pressing and still unexplored question. In this work, we conduct a quantitative evaluation of different IR models in scenarios where both human-written and LLM-generated texts are involved. Surprisingly, our findings indicate that …
    [P] Numpy / Numba implementation of IVF/PQ ANN index which is as fast as faiss
    Hi, just sharing with my recent github project fast-ivf in which I implemented from scratch IVF (+PQ) index using purely numpy/numba libraries. I did it mostly for an educational purpose. I also implemented something which I called `CompressedFastIVF` index which trains auto shallow autoencoder using kmeans assignments to reduce the dimensionality of the source embeddings, which seems to be a nice alternative for PQ method, at least on my data. Surprisingly, when I compared my implementation with the faiss library, I got 10 times speed up on my custom dataset with about 900k vectors of size 1024 when using simple IVF index. There are probably multiple reasons for that e.g. I use numpy with mkl library for algebraic operations, different implementation for kmeans (which result in different clusters sizes distribution) etc. When I tested it on other publicly available datasets, the speedup is much slower, which showed me that we should always test ANN libraries on the target data. ​ ​ ​ submitted by /u/kmkolasinski [link] [comments]  ( 9 min )
    [D] Machine Learning in Health
    I would like to know if someone currently works or has worked as a machine learning engineer in the field of medical science / health and if so, I would like to know about their experiences. The background is that I got the possibility to either work in the medical field or robotics and I can't really decide and thus looking for some input. I am most curious about what you did in your work and if it felt fun / rewarding. Thanks a lot! submitted by /u/Numerous_Talk7940 [link] [comments]  ( 9 min )
    [D] [P] ML Forecasting of Satellite Imagery and Handling Missing Data (Clouds)? [P]
    Currently working on a personal project using satellite imagery to identify and generate a forecast of algae on the oceans surface. Generally, using an algorithm to identify times/locations of algae on the surface and an ML model to generate a forecast. I have a method of identifying the algae down (not ML), where my output is a binary raster of where the algae is/not on a given day/period, which would be my target data. I also have plenty of possible predictor data (also rasters) to use in a model. In the past, the only ML I've done with raster data is supervised classification with images and their labels. This means I am having trouble even finding a starting point for this part of my project. Another issue I know will be a complication is that in both my target and predictor data, there are areas of missing values, where clouds were in the way of the sensor. I am hoping I can find a model that can handle missing values to avoid imputation. I am quite new to the ML space, and have very little experience using raster data in ML. Just hoping for some ideas or places to start reading up on possible methods dealing with similar issues. Any help is appreciated. submitted by /u/DisgustedApe [link] [comments]  ( 9 min )
    [D] [P] Bitnet in Pytorch or Jax
    I was interested by the new bitnet paper https://arxiv.org/pdf/2310.11453.pdf, and was wondering if there was any way to use the 1 bit (1 or -1) in actual practice and how? More specifically, I know that you can do this with cuda (which I don't have any experience with) but it would be much better if there was a way to do this on a TPU (Jax?). Any implementation I've seen so far just pretends like they are using 1 bit but representing it with higher precisions. submitted by /u/Additional-Ad-7043 [link] [comments]  ( 9 min )
    [D] OmniAI - ETL for AI applications
    Hey r/MachineLearning, I'm one of the founders of OmniAI. We just got accepted into YC's W24 batch, and we’re super excited on simplifying AI data workflows. OmniAI is a data infrastructure layer for AI. We’re syncing a company's data into a central warehouse that's optimized for AI interactions (vectorized, graph relations, etc.). Models can be run directly on that warehouse and kept up to date with your business intelligence. We'd love to hear about any pain points you are having with vector databases. Any insight/feedback is very much appreciated 😊 submitted by /u/travelingladybug23 [link] [comments]  ( 9 min )
  • Open

    UK, US, EU and China sign declaration of AI's 'catastrophic' danger
    The UK, US, EU, and China have signed the Bletchley declaration, acknowledging the potential catastrophic risks posed by artificial intelligence (AI). The declaration does not establish an international testing hub in the UK but sets a template for future collaboration. The signatories recognize the potential for serious harm from AI models and agree on the urgency of understanding the risks. The UK Prime Minister and the UK Technology Secretary welcomed the declaration, emphasizing the need for collective action in addressing the risks of frontier AI. The declaration marks a diplomatic success for the UK, which hosted the AI safety summit. There is little international agreement on global AI regulations or who should develop them. The US announced the creation of a separate American AI Safety Institute, while the EU is in the process of passing an AI bill. The UK government plans to properly understand the problem before applying solutions and denies falling behind international counterparts. Source : https://www.theguardian.com/technology/2023/nov/01/uk-us-eu-and-china-sign-declaration-of-ais-catastrophic-danger submitted by /u/NuseAI [link] [comments]  ( 9 min )
    MASTER CHIEF vs BLACK PANTHER | AI Multi-VS
    This project uses AI such as Chat GPT-4, Eleven Labs, D-ID, & Midjourney to simulate a Virtual AI Co-Host of Cortana from the Halo Franchise. Cortana is fully voiced, modeled, & lip sank to simulate an actual artificial intelligence evaluation on Duels between characters in all Media Universes. submitted by /u/AcanthisittaCheap914 [link] [comments]  ( 9 min )
    What data/dataset would you love to have for your AI project?
    As the title say, what dataset are you looking for but find it difficult to acquire for your AI/ML project/business? Also explain what you're trying to build and how it can be useful! submitted by /u/nobilis_rex_ [link] [comments]  ( 9 min )
    Need Help About Height Map Analysis
    As mentiones in the title, I need help with height map analysis. I want to create an artificial intelligence that can analyze the height map given as an input and mark areas such as mountainous areas, river beds, plains, rocks, cliffs and other geographical details on the map. Is there any advice you can give me or could you provide suggestions which will help me to move forward? I really am insterested in this project and want to work on it. Also, if you know any examples and/or studies related to this subject, can you please share them? I am looking forward to discovering more information regarding this topic. submitted by /u/PlayerWell [link] [comments]  ( 9 min )
    Microsoft starts selling AI tool for Office, which could generate $10B/y by 2026
    Microsoft has started selling its artificial intelligence tool, Copilot, as an add-on to Office productivity software subscriptions. The tool appears in Word, Excel, and other Office programs and is priced at $30 per person per month. Piper Sandler analysts estimate that Copilot could generate over $10 billion in annualized revenue by 2026. Microsoft aims to leverage its dominant position in the productivity software market, while Google is selling its own AI enhancement for Workspace tools. Piper Sandler's model assumes that 18% of eligible users will use Copilot, driven by a fear of missing out (FOMO) element. Companies without Copilot may be at a disadvantage in competitive industries. Microsoft CEO Satya Nadella stated that customers who use Copilot find it indispensable. Microsoft has initially targeted the largest companies for Copilot adoption, with 40% of Fortune 100 companies already using it in an invitation-only paid early-access program. While there is limited data on Copilot's performance, organizations are encouraged to experiment with generative AI, which can create synthetic images and text with minimal human input. Microsoft faces the challenge of expanding Copilot adoption beyond a small core of end users to achieve wide deployment. Analysts suggest that Copilot could be distributed to highly paid executives to help prioritize email messages and understand documents, but caution that technically savvy employees familiar with generative AI may be better suited for early adoption. Microsoft may also benefit from companies using additional Azure cloud services, such as Purview for data management, during the setup of Copilot. Source : https://www.cnbc.com/2023/11/01/microsoft-365-copilot-becomes-generally-available.html submitted by /u/NuseAI [link] [comments]  ( 9 min )
    17 AI tools for Marketing
    submitted by /u/Senior_tasteey [link] [comments]  ( 8 min )
    Is there a way to implement a bunch of ChatGPT Retrieval Plugin + Nougat: Naturals Optical Understanding for Academic Documents + Vector database?
    Please tell me how I can correctly transfer all my books and textbooks and documents in PDF format to a vector database while preserving the layout structure and equations? Maybe some of the people implemented this idea using Nougat: Neural Optical Understanding for Academic Documents (https://facebookresearch.github.io/nougat /)? If so, I ask you to say a few words about how you did it. ​ And let me ask you another question: how exactly does the ChatGPT Retrieval Plugin help you in the process of solving problems? Will it be possible to use it to extract information from your vector database during the ChatGPT dialog? ​ I am grateful in advance for the answers. submitted by /u/Imunoglobulin [link] [comments]  ( 9 min )
    AI Tools Blur the Line Between Marketing Strategy and Tactics for Startups
    AI tools have become popular in marketing strategies for startups, allowing for faster execution and content creation. However, there is a concern that these tools may blur the line between strategy and tactics, leading to a lack of success stories from startups. Startups should have a solid strategy in place before relying solely on AI tools for marketing. AI tools can be valuable for solopreneurs and provide guidance and assistance in bouncing ideas off. Startups should view AI tools as a compass rather than a magic wand, guiding them along the way to their strategic goals. Using ChatGPT as an example, providing custom instructions and asking questions can lead to better marketing decisions. Source : https://www.erwanderlyn.com/p/chat-gpt-for-marketing-strategy submitted by /u/NuseAI [link] [comments]  ( 9 min )
    Silly, I had the Terminator from Terminator 2 play Zork as the Terminator. I'm still learning.
    submitted by /u/notlikelyevil [link] [comments]  ( 8 min )
    Analysis of AI Risk Discourse - 'AI Risk: An Illusion of the Future?'
    submitted by /u/LaVolpe223 [link] [comments]  ( 8 min )
    Dilemma
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 8 min )
    One-Minute Daily AI News 10/31/2023
    The contribution of generative artificial intelligence to global gross domestic product is now expected to be higher within the next 10 years as adoption of the emerging technology is expected to grow, Goldman Sachs has said. NVIDIA has unveiled a custom large language model, the technology on which artificial intelligence tools like ChatGPT are based, which the company has developed for their internal use. Trained on NVIDIA’s proprietary data, “ChipNeMo” will generate and optimize software and provide assistance to human designers in building semiconductors.[2] IBM Launches Generative AI Coding Assistant “watsonx” for Mainframe Modernization.[3] The F.D.A. has approved many new programs that use artificial intelligence, but doctors are skeptical that the tools really improve care or are backed by solid research.[4] Sources: [1] https://www.thenationalnews.com/business/technology/2023/10/31/generative-ais-economic-contribution-likely-to-rise-goldman-sachs-says/ [2] https://research.nvidia.com/publication/2023-10_chipnemo-domain-adapted-llms-chip-design [3] https://winbuzzer.com/2023/10/30/ibm-launches-generative-ai-coding-assistant-watsonx-for-mainframe-modernization-xcxwbn/ [4] https://www.nytimes.com/2023/10/30/health/doctors-ai-technology-health-care.html submitted by /u/Excellent-Target-847 [link] [comments]  ( 9 min )
  • Open

    MetNet-3: A state-of-the-art neural weather model available in Google products
    Posted by Samier Merchant, Google Research, and Nal Kalchbrenner, Google DeepMind Forecasting weather variables such as precipitation, temperature, and wind is key to numerous aspects of society, from daily planning and transportation to energy production. As we continue to see more extreme weather events such as floods, droughts, and heat waves, accurate forecasts can be essential to preparing for and mitigating their effects. The first 24 hours into the future are especially important as they are both highly predictable and actionable, which can help people make informed decisions in a timely manner and stay safe. Today we present a new weather model called MetNet-3, developed by Google Research and Google DeepMind. Building on the earlier MetNet and MetNet-2 models, MetNet-3 p…  ( 94 min )
  • Open

    Variance reduction technique proof?
    submitted by /u/Massive_Cup_4458 [link] [comments]  ( 9 min )
    Newbie here - trying to make crude image stabilization using RL
    Hi there! I recently dived into the topic of RL after finishing a neural networks course at my university. I have basic understanding of the underlying principles of CNNs, the general idea of what Reinforcement Learning is, and I'm trying to learn more by making a project. The problem that I came up with is as follows: The system receives consecutive images frame after a frame (let's assume the frames come from a prerecorded video of stationary object, but the cameraman's hand is shaking/moving slowly) and tries to compute offsets for them that allow the user to align new frames to the original one. My idea is to use RL to train a network to recognize how the current frame is offset from the original (initial, first frame fed to the network) to allow some other software or even the us…  ( 10 min )
    Invalid Action Masking when action space is continuous
    Background: I am relatively new to RL, so apologies if this post comes off as repititive or the solution is immediately obvious. But to my knowledge, I couldn't find any examples online for my use-case hence posting here. The problem I'm trying to solve is a relatively simple one. Let's say that my action space has only two variables x1 & x2, both continuous (Box). I want my valid actions to be only those where x1 + x2 < k where k is some constant. So I decide to use invalid action masking because that's the most eficient solution. But all I see online is examples of invalid action masking when actions are categorical. Even the ray discussion forums say that action masking for continuous action spaces isn't something they have done. Has anyone done this before? Any resources that you can point towards? submitted by /u/Solitary_Walker [link] [comments]
  • Open

    Dialogue-guided visual language processing with Amazon SageMaker JumpStart
    Visual language processing (VLP) is at the forefront of generative AI, driving advancements in multimodal learning that encompasses language intelligence, vision understanding, and processing. Combined with large language models (LLM) and Contrastive Language-Image Pre-Training (CLIP) trained with a large quantity of multimodality data, visual language models (VLMs) are particularly adept at tasks like image captioning, […]  ( 16 min )
    How Reveal’s Logikcull used Amazon Comprehend to detect and redact PII from legal documents at scale
    Today, personally identifiable information (PII) is everywhere. PII is in emails, slack messages, videos, PDFs, and so on. It refers to any data or information that can be used to identify a specific individual. PII is sensitive in nature and includes various types of personal data, such as name, contact information, identification numbers, financial information, […]  ( 8 min )
  • Open

    Using classical statistics to avoid regulatory burden
    On June 29 this year I said on Twitter that companies would start avoiding AI to avoid regulation. I followed that up with an article Three advantages of non-AI models. The third advantage I listed was Statistical models are not subject to legislation hastily written in response to recent improvements in AI. The chances that […] Using classical statistics to avoid regulatory burden first appeared on John D. Cook.  ( 5 min )
    Executive order on differential privacy
    This week President Biden signed a long, technically detailed executive order that among other things requires the Secretary of Commerce to look into differential privacy. Within 365 days of the date of this order … the Secretary of Commerce … shall create guidelines for agencies to evaluate the efficacy of differential-privacy-guarantee protections, including for AI. […] Executive order on differential privacy first appeared on John D. Cook.  ( 6 min )
    Differential entropy and privacy
    Differential entropy is the continuous analog of Shannon entropy. Given a random variable X with density function fX, the differential entropy of X, denoted h(X), is defined as where the integration is over the support of fX. You may see differential entropy defined using logarithm to a different base, which changes h(X) by a constant […] Differential entropy and privacy first appeared on John D. Cook.  ( 5 min )
  • Open

    Turing’s Mill: AI Supercomputer Revs UK’s Economic Engine
    The home of the first industrial revolution just made a massive investment in the next one. The U.K. government has announced it will spend £225 million ($273 million) to build one of the world’s fastest AI supercomputers. Called Isambard-AI, it’s the latest in a series of systems named after a legendary 19th century British engineer Read article >  ( 6 min )
    Turing’s Mill: AI Supercomputer Revs UK’s Economic Engine
    The home of the first industrial revolution just made a massive investment in the next one. The U.K. government has announced it will spend £225 million ($273 million) to build one of the world’s fastest AI supercomputers. Called Isambard-AI, it’s the latest in a series of systems named after a legendary 19th century British engineer Read article >  ( 6 min )
    Unlocking the Power of Language: NVIDIA’s Annamalai Chockalingam on the Rise of LLMs
    Generative AI and large language models are stirring change across industries — but according to NVIDIA Senior Product Manager of Developer Marketing Annamalai Chockalingam, “we’re still in the early innings.” In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Chockalingam about LLMs: what they are, their current state and their future Read article >  ( 5 min )
    Unlocking the Power of Language: NVIDIA’s Annamalai Chockalingam on the Rise of LLMs
    Generative AI and large language models are stirring change across industries — but according to NVIDIA Senior Product Manager of Developer Marketing Annamalai Chockalingam, “we’re still in the early innings.”  In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Chockalingam about LLMs: what they are, their current state and their future Read article >  ( 5 min )
  • Open

    Exclusive Invitation: Join My Talk on AI-Bots This Morning!
    Hey Friend,  ( 7 min )
  • Open

    TFGNN: Tensorflow GNN
    Does anyone have any experience with using the tensorflow/gnn library for training and testing neural networks on graph data. Github:Tensorflow/gnn submitted by /u/Choice-Secret-99 [link] [comments]
    Recommended object tracking (deep) models
    Hi, I already have a pretty good net that detects the object I look for in a single frame. What architectures are there for turning all these single-frame predictions into an object tracking algorithm? ​ submitted by /u/jonathan923_ [link] [comments]
  • Open

    ArXiv Pre-Print “Evaluating AI Systems under Uncertain Ground Truth: a Case Study in Dermatology”
    In supervised machine learning, we usually assume access to ground truth label for evaluation. In many applications, however, these ground truth labels are derived from expert opinions. Disagreement among these experts is typically ignored using simple majority voting or averaging. Unfortunately, this can have severe consequences by over-estimating performance or mis-guiding model selection. In our work presented in this article, we tackle this problem by introducing a statistical framework for aggregating expert opinions. The post ArXiv Pre-Print “Evaluating AI Systems under Uncertain Ground Truth: a Case Study in Dermatology” appeared first on David Stutz.  ( 4 min )

  • Open

    TMLR Paper “Conformal Prediction under Ambiguous Ground Truth”
    Conformal prediction uses a held-out, labeled set of examples to calibrate a classifier to yield confidence sets that include the true label with user-specified probability. But what happens if even experts disagree on the ground truth labels. Commonly, this is resolved by taking the majority voted label from multiple expert. However, in difficult and ambiguous tasks, the majority voted label might be misleading and a bad representation of the underlying true posterior distribution. In this paper, we introduce Monte Carlo conformal prediction which allows to perform conformal calibration directly against expert opinions or aggregate statistics thereof. The post TMLR Paper “Conformal Prediction under Ambiguous Ground Truth” appeared first on David Stutz.  ( 4 min )
  • Open

    Macbook Pro M3 for LLMs and Pytorch? [D]
    My current PC laptop is soon ready to retire, having worked for seven years. As a replacement I'm considering the new Macbook Pros. It is mainly the battery time which makes me consider Apple. These are my requirements for the laptop: great battery time 16" since I'm old and my eyes are degraded dual external monitors software engineering including running some local docker images Then I have two ML requirements which I don't know if I could fulfill using a laptop: good performance for working with local LLMs (30B and maybe larger) good performance for ML stuff like Pytorch, stable baselines and sklearn In order to fulfill the MUST items I think the following variant would meet the requirements: Apple M3 Pro chip with 12‑core CPU, 18‑core GPU, 16‑core Neural Engine 36 GB memory 512 GB SSD Price: $2899 Question: Do you think I could fulfill the ML requirements using a Macbook Pro M3? Which config would be smart to buy in such case? Thankful for advice! submitted by /u/nizego [link] [comments]  ( 9 min )
    [D] Is Nvidia's new "System Memory Fallback for Stable Diffusion" also compatible with model training / inference in general?
    Hi all, today Nvidia released a new driver version that appears to allow the GPU to use system memory instead of crashing when it runs out, seen here: https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/system-memory-fallback-for-stable-diffusion I was wondering if this is compatible with model training and inference in Tensorflow and/or PyTorch, and how I could enable that (or if it would just work by default). This is especially confusing to me as I run Tensorflow in WSL so I don't know if this setting would carry over. submitted by /u/joshglen [link] [comments]  ( 9 min )
    [D] Can Diffusion Models really be Steered? Not really! Not even ControlNet, ICCV23 Best Paper
    While diffusion models (e.g. Stable Diffusion) are all the rage, they don't seem to be prepared for downstream tasks. ControlNet looks great (on paper), but open-source implementations for mere mortals aren't ready for prime time. Do you have examples that show the contrary? Will FAANGs and not-really-open Research Labs be the only ones capable of making it happen? submitted by /u/btcmx [link] [comments]  ( 9 min )
    [R] - An Open-sourced Data Contamination Reports for Llama Series Models
    Data Contamination in Multi-choice QA Benchmarks How much test samples are included in Llama's training data? This presented how much test samples in popular Multi-Choise QA benchmarks are included in the training data of Llama models (Common Crawl 2017–2020). Three types of data contamination: input-only contamination, input-and-label contamination, and all contamination containing both. Input-only contamination represents contaminations where only input part of test samples was included in the training data. On the contrary, input-and-label contamination indicate both input and the answer were included in the training data. Impact on Model Performance How much data contamination affects model evaluation? The full open sourced data contamination report: https://arxiv.org/abs/2310.17589 All data and code: https://github.com/liyucheng09/Contamination_Detector submitted by /u/Simple-Leopard-7646 [link] [comments]  ( 9 min )
    [R] Small Language Models Fine-tuned to Coordinate Larger Language Models improve Complex Reasoning
    Paper link http://arxiv.org/abs/2310.18338 Description We introduce DaSLaM, which uses a decomposition generator to decompose complex problems into subproblems that require fewer reasoning steps. These subproblems are answered by a solver. We use a relatively small (13B parameters) LM as the decomposition generator, which we train using policy gradient optimization to interact with a solver LM (regarded as black-box) and guide it through subproblems, thereby rendering our method solver-agnostic. Evaluation on multiple different reasoning datasets reveal that with our method, a 175 billion parameter LM (text-davinci-003) can produce competitive or even better performance, compared to its orders-of-magnitude larger successor, GPT-4. Additionally, we show that DaSLaM is not limited by the solver's capabilities as a function of scale; e.g., solver LMs with diverse sizes give significant performance improvement with our solver-agnostic decomposition technique. submitted by /u/Gaussian_Kernel [link] [comments]  ( 9 min )
    [P] Open-source modular observability for AI Systems
    Hi r/MachineLearning community! Over the last three years, we have collaborated with hundreds of teams to enhance our understanding of observability requirements in AI systems. ML teams are trying to log as much as they can at every possible dimension. To satisfy these needs, they must be able to: - Log anything from every part of the AI Infra, - Observe and interpret the logged data at scale in a flexible fashion, - Add layers and layers of automations around the logged data. Thrilled to share with you the new product we built - AimOS, a framework to connect the dots and ensure modular observability for AI Systems! Easily log, connect, and observe any part of your AI Systems – from experimentation and production stages to input prompts and monitoring. AI Systems are not determ…  ( 10 min )
    [D] Do you calculate the accuracy and loss of a neural network or batches or the whole dataset?
    Hello, I'm curious, when evaluating a neural network for both the training and validation data, do you calculate the accuracy across the entire dataset, or at every batch and then find the average at every batch? submitted by /u/CrunchyMind [link] [comments]  ( 9 min )
    [D] Unsupervised Clustering without knowing number of classes
    Does anyone know where to find the best models for unsupervised clustering problems that don't specify the number classes? For example I googled unsupervised MNIST but IIC which holds the record requires the output dimension (k=10) to be specified? Is there a name for unsupervised clustering without knowing the number of classes? (I know of density/hierarchical clustering algorithms but am unaware of many deep learning ones) And specifically are results charted anywhere? I'm researching the topic and it seems knowing the number of things you're looking for is half the battle. I can find papers on methods that aim to find the number of clusters etc but are there any benchmarks to compare? submitted by /u/BigBrainUrinal [link] [comments]  ( 9 min )
    [R] Announcing Distil-Whisper - 6x faster than Whisper-large-v2 and performs within 1% WER on out-of-distribution
    Hey r/MachineLearning! ​ At Hugging Face, we've worked hard the last months to create a powerful, but fast distilled version of Whisper. We're excited to share our work with you now! Distil-Whisper is 6x faster than Whisper-large-v2 and performs within 1% WER on out-of-distribution datasets. On long-form audio, we even achieve better results thanks to a reduction in hallucinations. ​ For more information, please have a look: - GitHub page: https://github.com/huggingface/distil-whisper/tree/main - Paper: https://github.com/huggingface/distil-whisper/blob/main/Distil_Whisper.pdf ​ Quick summary: Distillation Process We've kept the whole encoder, but reduced the decoder to just 2 layers. Encoding takes O(1) forward passes, decoding takes O(N). To improve speed, all that matters is the decoder! The encoder is frozen during distillation while we fine-tune all of the decoder. Both KL loss and pseudo-labeling next word prediction is used. Data We use 20,000h of open-sourced audio data coming from 9 diverse audio datasets. A WER-filter is used to make sure low-quality training data is thrown out. Results We've evaluated the model only on out-of-distribution datasets and are only 1% worse than Whisper-large-v2 on short-form evals (CHiME-4, Earnings-22, FLEURS, SPGISpeech). On long-form evals (Earnings, Meanwhile, Rev 16) we beat Whisper-large-v2 thanks to a reduction in hallucinations. Robust to noise Distil-Whisper is very robust to noise (similar to its teacher). We credit this to keeping the original encoder frozen during training. Pushing for max inference time Distil-Whisper is 6x faster than Whisper on both short-form and long-form audio. In addition, we employ Flash Attention and chunked decoding which helps us achieve a real-time factor of 0.01! Checkpoints?! Checkpoints will be released this Thursday and will be directly integrated into Transformers. All checkpoints will be licensed under MIT. ​ submitted by /u/pvp239 [link] [comments]  ( 9 min )
    [D] model for "understanding" dashcam images
    i have some dashcam footage from my car and want to see how a model could embed the images in an unsupervised (or self-supervised) way so i dont have to label everything. like if scenarios that are semantically similar, but different in pixel space (pulling out the driveway in the day versus pulling in at night) could be clustered close-ish together in latent space so that i could label fewer images and have the model get the other using something like k-nearest or whatever. i am starting off with just frame level before i try to tackle videos as a sequence of images (will probably lose interest by that point, so want to get images dealt with first). i looked in to VAEs and tried training one from scratch on my data but i dont have enough compute power for that. does anyone here have any ideas about this? any pretrained off the shelf models that i could use for this? any leads for a literature survey? submitted by /u/samrus [link] [comments]  ( 9 min )
    [D] - People here who mastered out of their PhDs, do you regret it? How has your life been after that?
    Hi everyone. I'm a third-year Ph.D. student doing ML-related research and have no publications so far. I do have a couple of ongoing projects that will lead to first-author papers in the next few months. I'm also at a point where I think mastering out and getting a job might be a better option. But I also worry that I might regret not getting a Ph.D. I love research but I feel the academic environment is not a good fit for me. I just want to hear from people who were in a similar position as me. Did you stick through your PhD or did you master out? How has the life been after that? I started my PhD straight out of my undergrad so I didn't get any industry experience. So I've been thinking that getting a job with my master's and spending some time in the industry could be a good option. Then I can return back to grad school if I still have that urge. Or I can simply brave through my current situation and just get a PhD and then work in the industry. Any opinions are welcome! submitted by /u/llmlift [link] [comments]  ( 9 min )
    [P] Web-based AI content generative tools that require no inscription to be used?
    I need some AI generative tools for a state-sponsored 1 hour course a friend of mine is hosting for people aged from 12 to 60 years old. They are taking an "induction" into AI with the goal of learning how to produce content with the help of AI. The idea is that they have a range of web-based AI tools at disposal in a computer in order to do a "collaborative creative film" of sorts. The different people in the course will be grouped into pairs, and then they will choose some of AI tools in a list to help them with the creative process. The AI tools in the list should not require inscription. The no-inscription rule it's a requirement because of data protection laws (no personal e-mail or other personal info should be given). Also, it would be preferable that, if the AI requires text inpu…  ( 10 min )
    [D] ctransformers vs llama-cpp-python
    what's the difference between ctransformers and llama-cpp-python? submitted by /u/kaoutar- [link] [comments]  ( 8 min )
    [R]What can cause the silhouette score to increase like this at the end? Value of K=10 for about 200 dataset seems too much, no?
    ​ https://preview.redd.it/gkibii705kxb1.png?width=504&format=png&auto=webp&s=ae2820ab0d324224177fa971319cd82965eb0129 submitted by /u/Gandalfthebrown7 [link] [comments]  ( 8 min )
    [D] Using C++ for training a Vision Transformer Agent
    Hey! So, I am currently enrolled in a Master's Degree program and started my thesis this semester. I still have a year to develop it, so what I'm doing now is gathering as much info as I can about libraries for ML. My adviser and I have decided that we're going to use Vision Transformers as our approach to train an agent to inspect products at factories. So, a little background on me: I am a self-taught game developer. I've learned Lua, C, C++, and C# to make games, that's what I'm good at. I've studied the basics of ML at uni, but it was on Python. My most proficient language is C++, as I've worked a lot with it and feel comfortable with it, so I was thinking: Are there any good ML libraries for C++? Libraries that are as easy to use as Python libs (TensorFlow, Pytorch, etc.), for example? I love the way that you have control over the resources when coding with C and/or C++. Runtime speed is a bonus too. Thanks for the help :) submitted by /u/retroJRPG_fan [link] [comments]  ( 9 min )
    [P] Visualizing Geopolitics with UN Voting data and Embeddings
    Hi all, I'm upskilling myself in Machine Learning. As a learning project, I trained an embedding model to group countries in the UN based on their voting habits. One I thought the results are interesting so I wanted to share. Medium post here: https://medium.com/@sambhattacharyya/visualizing-geopolitics-with-un-data-and-machine-learning-b93f84270900 View the the interactive 3d graph of embeddings here: https://sambhattacharyya.com/visualizing-geopolitics/index.html, 2) I'm still learning, so if anyone has feedback on the technical approach (I'm especially sure that using MSE loss isn't ideal here). Colab notebook here: https://colab.research.google.com/drive/1YkM_AsHCcs_boOCqobyQAs38Q43cm1Th#scrollTo=6801c41a-03d2-447f-8929-15d4f399df0f Raw dataset I compiled here: https://huggingface.co/datasets/sam-bha/un-general-assembly-votes-2000-2023 submitted by /u/sam_bha [link] [comments]  ( 9 min )
    [D] has anyone experimented with neural search for e-commerce applications?
    Hi. I’m an AI engineer of an emerging retailer. We’re continuously pushing the boundaries of our user search experience. We’ve got a massive inventory, hence a lot of data to be managed. This got me thinking about the untapped potential of neural search. I've had my hands on OpenAI's GPT and Deepset's Haystack lately. Both tools are great in specific scenarios, but integrating them seamlessly at an enterprise scale is challenging, especially when we're talking about real-time user interactions. The primary challenge remains in managing multimodal data efficiently without sacrificing speed. To add context, my goal in leveraging something like GPT for e-commerce is to create a more intuitive, conversational, and responsive search function. Imagine a user typing in a vague description or query, and the system providing product suggestions like a seasoned salesperson would. Given the vast product range, the neural search could bridge the gap between user intent and the most relevant product offerings. If anyone has experience with this I’d like to hear your thoughts, and if you have any other tool recommendations for this pls do share. I’d be grateful for any help submitted by /u/PositiveFixing12 [link] [comments]  ( 9 min )
    [D] Is this close enough to be usable? Need your inputs: Automated RAG testing tool. AI Data Pipelines for Real-World Production (Part 3)
    Hey there, Redditors! I'm back with the latest installment on creating dependable AI data pipelines for real-world production. If you've been following along, you know I'm on a mission to move beyond the "thin OpenAI wrapper" trend and tackle the challenges of building robust data pipelines. With 18 months of hands-on experience and many user interviews, I realized that with the probabilistic nature of systems, we need better_testing.gpt: 1. As you build you should test The world of AI is a fast-moving one, and we've realized that just working on systems is not an optimal design choice. By the time your product ships, it might already be using outdated technology. So, what's the lesson here? Embrace change, test along, but be prepared to switch pace. 2. No Best Practices Yet for R…  ( 10 min )
    [P] A site where you can ask the same question to GPT-2, GPT-3, GPT-3.5 and GPT-4, and compare the outputs
    Hi /r/machinelearning! I've been working with my collaborators on a site where you can compare OpenAI models to get a sense of the improvement over time of the models: https://theaidigest.org/progress-and-dangers https://preview.redd.it/khruhgkp7jxb1.png?width=1960&format=png&auto=webp&s=21d13125145f7fae7351686d4078868d65cbf8c3 It includes a number of things that you might be interested in: You can ask any question and compare the outputs from the OpenAI models: https://preview.redd.it/s5e9acev8jxb1.png?width=1458&format=png&auto=webp&s=0c3e5ba3661fccfc4f4ba60db346b6142b1e52f3 Visualises OpenAI models benchmark performance across 22 benchmarks: https://preview.redd.it/vhai63308jxb1.png?width=1948&format=png&auto=webp&s=07f65f131b2e6d5122400120a11d24205b7d08d6 Shows examples of benchmark outputs for GPT-2 to GPT-4 https://preview.redd.it/f3p7ni068jxb1.png?width=1980&format=png&auto=webp&s=dfe25c8c4a486a0df3c4cce2e4497fd250163bd1 Discusses some dangerous emerging capabilities, such as biological weapons: https://preview.redd.it/n6hinz7b8jxb1.png?width=2002&format=png&auto=webp&s=70cf0a0c228e1ac194146040c23a7f41dfe4e09a Includes an example of a simple agent autonomously exploiting a vulnerability in a game's code: https://preview.redd.it/5a584w7f8jxb1.png?width=1944&format=png&auto=webp&s=3867a865c06b6e36fc2424f6ced038248ee0cafd I hope you'll find this a valuable resource for getting familiar with older LMs, comparing the outputs, and thinking about what's next in this space. Here's a link to the site: https://theaidigest.org/progress-and-dangers submitted by /u/timegentlemenplease_ [link] [comments]  ( 9 min )
    [D] Problems with WGAN for time series imputation
    I am trying to implement a WGAN for time series imputation. However, I am facing many problems due to the fact that I am a novice in generational models and I am not used to many concepts related to WGAN. ​ I know that WGAN have a problem with critical weights that make them diverge infinitely, and there are three main ways to solve this problem as discussed in this other post. However, implementing them with the use of recurrent neural networks have been a total headache. ​ 1- With a weight clipping with a threshold of 0.1 (as suggested in the literature) my critical loss does not change at all. I think this may be due to a gradient vanishing problem in recurrent networks. ​ 2- Using Spectral Normalization is not well implemented in Torch for recurrent layers as it only handles one layer weight. ​ 3- I cannot use Gradient Penalty because its implementation, as far as I know, depends on an interpolation of the critical value of an interpolated data between real and generated samples, which is not available to me as all samples have a mixture of real and missing values. ​ Is there a solution to this problem that I am missing? submitted by /u/SrPinko [link] [comments]  ( 9 min )
    [D] How to train a model that'll generate an "average" image based on a large set of images?
    There's a lot of image generators that'll allow you to generate multiple images based on the input data. Is there something that'll generate an image that's an average of the trained model instead? Didn't plan on using machine learning for this project but realized it might be an interesting path to explore. submitted by /u/max_b_jo [link] [comments]  ( 9 min )
    [D] Fast and reliable keyword extraction
    Please forgive my ignorance as I am a software engineer and fairly new to datascience and machine learning. Also, feel free to delete my post if this is not the right place to ask. I am currently working on a bookmark manager app that offers content preservation and automatic keywords extraction among other features to extend these bookmarks and make them more discoverable. For the life of me, I can't find a reliable way to extract keywords. I so far tried to use a python library called Newspaper3k But results were a mixed bag, half of the time, it will knock it out of the park with very accurate results, the other half, it will just output garbage. I have switched to using openai gpt3.5 APls but I really hate it. It's slow, very verbose and give me feeling of disgust because it's like using a machine gun to get rid of a fly. I have looked in huggingface and tried a couple of models but no luck so far. Please help. I am happy to selfhost something or just pay for a good APls submitted by /u/goodkernel [link] [comments]  ( 9 min )
  • Open

    How to use neural network to learn a Q table?
    I have studied the Q learning algorithm and applied it to the classic gridworld problem. I was able to use the update formula to generate the correct Q table. Now I have been assigned to generate the Q table using a neural network, rather than the update formula. However, I do not understand how a neural network could be used to learn a Q table. I would say that the input should be the state of the agent, and the output should be an action. But how do I know how many layers I should make? And how many nodes in each layer? And how do I optimize the weight? Any guidance would be immensely appreciated. submitted by /u/Parking_Antelope8865 [link] [comments]
    CodeFusion: A Pre-trained Diffusion Model for Code Generation
    submitted by /u/nickb [link] [comments]
    RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models — Together AI
    submitted by /u/nickb [link] [comments]
  • Open

    Two ChatGPTs Break the Silence: An Unmissable Verbal Showdown on AI Ethics!
    Hey guys I made a video earlier last night using two ChatGPT accounts with custom instructions running GPT4 on voice and had them have a debate over the ethics of AI, I thought it was pretty interesting and fun to do. I wonder what other fun things I can experiment and make the two of them do lol. https://www.youtube.com/watch?v=fFoyCiAwmfY submitted by /u/adamariefox [link] [comments]
    ChatGPT, let us create movies about ChatGPT.
    submitted by /u/Philipp [link] [comments]
    Artificial intelligence in sport — the key to success?
    submitted by /u/donutloop [link] [comments]
    Nvidia tests chatbots in chip design process in bid to use more AI
    Nvidia is testing chatbots in the chip design process to incorporate more AI. The company has used a large language model augmented with 30 years of chip design data to create chatbots that can answer questions from junior designers, saving senior designers time. The research also found that adding specific data from the company's experience can make a relatively modest chatbot more accurate than an advanced one. Nvidia demonstrated the use of AI to generate code, aiming to enhance engineers' productivity rather than replace them. Source : https://www.reuters.com/technology/nvidia-tests-chatbots-chip-design-process-bid-use-more-ai-2023-10-30/ submitted by /u/NuseAI [link] [comments]
    FULL LIST: Who is attending Britain's AI Safety Summit tomorrow?
    submitted by /u/TBP-LETFs [link] [comments]
    Happy Halloween! Choose your house
    submitted by /u/Sea_Permit5660 [link] [comments]
    Elon Musk to attend Rishi Sunak’s AI safety summit in Bletchley Park
    submitted by /u/nick9000 [link] [comments]
    Exclusive: G7 to agree AI code of conduct for companies
    submitted by /u/donutloop [link] [comments]
    Biden administration aims to cut AI risks with executive order
    submitted by /u/donutloop [link] [comments]
    One-Minute Daily AI News 10/30/2023
    India leads the way in global AI skill penetration, finds Stanford University’s AI Index Report.[1] Google Bard, the conversational AI tool by Google, can now respond to your questions in real-time. You can turn it off and tell Bard to only respond once the answer is complete, but now, by default, Bard writes out the response in real time.[2] Chinese technology giant Alibaba said on Tuesday it has updated its artificial intelligence (AI) model Tongyi Qianwen and released a suite of industry-specific AI models amid an intensifying AI race among tech companies.[3] Biden signs sweeping executive order regulating artificial intelligence.[4] Sources: [1] https://www.firstpost.com/tech/news-analysis/india-leads-in-ai-skills-and-github-ai-projects-says-stanfords-ai-index-report-13318412.html [2] https://searchengineland.com/google-bard-can-now-respond-in-real-time-433954 [3] https://finance.yahoo.com/news/1-alibaba-upgrades-ai-model-035625254.html [4] https://www.thedailynewsonline.com/news/biden-signs-sweeping-executive-order-regulating-artificial-intelligence/article_d461cda8-7737-11ee-8036-93d6b4aa3413.html submitted by /u/Excellent-Target-847 [link] [comments]
  • Open

    DSC Weekly 31 October 2023
    Announcements Top Stories In-Depth The post DSC Weekly 31 October 2023 appeared first on Data Science Central.  ( 20 min )
    Data engineering career guide
    From ensuring that mobile apps function smoothly to facilitating personalized recommendations and targeted ads, data engineering powers the digital experiences that have become part of many of our day-to-day lives. There is currently a major need for knowledgeable and skilled professionals to fill open data engineer roles. Do you have the skills and experience to… Read More »Data engineering career guide The post Data engineering career guide appeared first on Data Science Central.  ( 21 min )
    Approaches to creating virtual fitting room software using AR and AI
    Virtual fitting room software with AR and AI is the next best alternative to physical stores. With many different kinds of virtual fitting room solutions on offer though, it can be hard to know which ones are the most feasible for your business. Let’s talk about the various approaches to developing such solutions.  Types of… Read More »Approaches to creating virtual fitting room software using AR and AI The post Approaches to creating virtual fitting room software using AR and AI appeared first on Data Science Central.  ( 21 min )
    How hybrid AI can help LLMs become more trustworthy
    Back in 2015, Pedro Domingos of the University of Washington’s computer science department published The Master Algorithm. In the book, Domingos explored the possibility that one master algorithm could indeed rule them all. The main challenge, he said, was to bring the AI tribes together so that the strengths of the various approaches could be… Read More »How hybrid AI can help LLMs become more trustworthy The post How hybrid AI can help LLMs become more trustworthy appeared first on Data Science Central.  ( 21 min )
    AGI Jesse and the future of finance
    Introduction The financial world is on the cusp of a remarkable transformation, thanks to the integration of advanced AI models like GPT-4. In this article, we delve into the evolving landscape of Machine Learning (ML) in finance and explore the potential impact of these cutting-edge AI systems. The need for speed: Reacting to regime shifts… Read More »AGI Jesse and the future of finance The post AGI Jesse and the future of finance appeared first on Data Science Central.  ( 18 min )
    How AI chatbots are transforming the world?
    AI chatbot technology has taken the world by storm. From assisting individuals with their content requirements to facilitating top-notch customer service for businesses, AI chatbot technology offers it all.  The technology has penetrated multiple prominent industries. And rightly so. AI chatbots provide customer insights that help businesses strategize their marketing and sales plans, ultimately driving… Read More »How AI chatbots are transforming the world? The post How AI chatbots are transforming the world? appeared first on Data Science Central.  ( 22 min )
    How technical program managers can build a robust Generative AI future
    The modern digital ecosystem, buzzing with the chatter of data and algorithms, presents both promises and challenges. In this intricate web, generative artificial intelligence (GenAI) shines as a beacon of innovation. To harness this power, enterprises need more than just cutting-edge technology. They need a bridge between ambition and realization—a role aptly filled by… Read More »How technical program managers can build a robust Generative AI future The post How technical program managers can build a robust Generative AI future appeared first on Data Science Central.  ( 21 min )
    Generative AI ethics: Navigating the boundary between human and machine creativity
    Generative AI is revolutionizing our creative landscape, unlocking unprecedented possibilities. But at what cost? Dive into the ethical dilemmas of this transformative technology, exploring the fine line between innovation and ethical consideration.  2022 was a huge year for Generative AI. The release of DALL-E 2 in April showed the public the possibilities of text-to-image Gen… Read More »Generative AI ethics: Navigating the boundary between human and machine creativity The post Generative AI ethics: Navigating the boundary between human and machine creativity appeared first on Data Science Central.  ( 23 min )
  • Open

    Riding the Rays: Sunswift Racing Shines in World Solar Challenge Race
    In the world’s largest solar race car event of the year, the University of New South Wales Sunswift Racing team is having its day in the sun. The World Solar Challenge, which first began some 35 years ago, attracts academic participants from across the globe. This year’s event drew nearly 100 competitors. The race runs Read article >  ( 6 min )
    Riding the Rays: Sunswift Racing Shines in World Solar Challenge Race
    In the world’s largest solar race car event of the year, the University of New South Wales Sunswift Racing team is having its day in the sun. The World Solar Challenge, which first began some 35 years ago, attracts academic participants from across the globe. This year’s event drew nearly 100 competitors. The race runs Read article >  ( 6 min )
    DLSS 3.5 With Ray Reconstruction Now Available in NVIDIA Omniverse
    The highly anticipated NVIDIA DLSS 3.5 update, including Ray Reconstruction for NVIDIA Omniverse — a platform for connecting and building custom 3D tools and apps — is now available.  ( 7 min )
    DLSS 3.5 With Ray Reconstruction Now Available in NVIDIA Omniverse
    The highly anticipated NVIDIA DLSS 3.5 update, including Ray Reconstruction for NVIDIA Omniverse — a platform for connecting and building custom 3D tools and apps — is now available.  ( 7 min )
  • Open

    Schneider Electric leverages Retrieval Augmented LLMs on SageMaker to ensure real-time updates in their ERP systems
    This post was co-written with Anthony Medeiros, Manager of Solutions Engineering and Architecture for North America Artificial Intelligence, and Blake Santschi, Business Intelligence Manager, from Schneider Electric. Additional Schneider Electric experts include Jesse Miller, Somik Chowdhury, Shaswat Babhulgaonkar, David Watkins, Mark Carlson and Barbara Sleczkowski.  Enterprise Resource Planning (ERP) systems are used by companies to […]  ( 10 min )
  • Open

    Control of spider like robots using GNN & RL
    hii this is my project but I have bare minimum coding knowledge and know even lesser about GNN and RL, has anyone worked on something like this before and would be able to dumb it down for me?? submitted by /u/sunshinebreakfast7 [link] [comments]
    RL & LLMS: An in-depth look at modern Game Playing AI Systems
    submitted by /u/AvvYaa [link] [comments]
    Why Gym/Gymnasium removed done from the step function
    submitted by /u/jkterry1 [link] [comments]
    "Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier", D'Oro et al 2023
    submitted by /u/gwern [link] [comments]
  • Open

    The spookiest Halloween scenes
    Google Bard has the ability to describe images. But it turns out what you get depends a lot on how you ask. I gave Bard this image and the prompt "Please describe this spooky Halloween scene". On the right is the image I got when I took the  ( 6 min )
    Bonus: more spooky Halloween scenes
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    Positive polynomials revisited
    The square of a real-valued polynomial is clearly non-negative, and so the sum of the squares of polynomials is non-negative. What about the converse? Is a non-negative polynomial the sum of the squares of polynomials? For polynomials in one variable, yes. For polynomials in several variables, no. However, Emil Artin proved nearly a century ago […] Positive polynomials revisited first appeared on John D. Cook.  ( 5 min )

  • Open

    Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence [N]
    https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/ It looks like content will have to be labeled, showing if it's AI-generated or not. And special rules will apply to: any model that was trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations; and any computing cluster that has a set of machines physically co-located in a single datacenter, transitively connected by data center networking of over 100 Gbit/s, and having a theoretical maximum computing capacity of 1020 integer or floating-point operations per second for training AI. Also, easier visas for "AI talent". submitted by /u/we_are_mammals [link] [comments]  ( 9 min )
    [D] Relevance Extraction in RAG Pipelines
    I came across this interesting problem in RAG, what I call Relevance Extraction. After retrieving relevant documents (or chunks), these chunks are often large and may contain several portions irrelevant to the query at hand. Stuffing the entire chunk into an LLM prompt impacts token-cost as well as response accuracy (distracting the LLM with irrelevant text), and and can also cause bumping into context-length limits. So a critical step in most pipelines is Relevance Extraction: use the LLM to extract verbatim only the portions relevant to the query. This is known by other names, e.g. LangChain calls it Contextual Compression, and the RECOMP paper calls it Extractive Compression https://twitter.com/manelferreira_/status/1713214439715938528 Thinking about how best to do this, I realized i…  ( 10 min )
    [D] Face Recognition
    I am working for a months on facial recognition system. I have 500 classes with 10 template for each class. I have applied almost and tested almost every model. Some of them are dlib 128 dim. Facenet (512), Insightface, Arcface. But none of them reduced the rate of false positive. I also fine tuned those models, I also trained a clasifier and I reached to 91 percent of accuracy with a very low quality images and also have a backlight effect. But the what I want my setup to be 99 percent which correctly classifies the person with a high similarity on true positive and a low similarity around 0.1 or 0.2 for an unknown class. Now my problem is that similarity is also high with unknown almost around 99 and I also used some distance metrics and it also not helping out. Note: the environment is a real world environment and I used CCTVs at a place where hundreds of unknown people visit daily. submitted by /u/No_Garbage9512 [link] [comments]  ( 9 min )
    [D] Computation graphs with in-place operations
    With a computation graph typically used to represent NNs, where nodes represent operations and edges represent data dependencies, is it possible to represent in-place operations? submitted by /u/thanrl [link] [comments]  ( 9 min )
    [D] How to evaluate if LLMs are following certain guidelines or not?
    I recently wrote a blog on evaluating whether your LLM applications are following required guidelines: https://uptrain.ai/blog/lost-in-translation-the-critical-impact-of-neglecting-guideline-adherence-in-llms?utm_source=reddit&utm_medium=reddit&utm_campaign=reddit Please let me know your feedback in the comment section submitted by /u/Vegetable-Skill-9700 [link] [comments]  ( 9 min )
    [R] RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models
    Blog: https://together.ai/blog/redpajama-data-v2 Hugging Face: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2 GitHub: https://github.com/togethercomputer/RedPajama-Data Description: RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30B documents in the corpus that additionally come with quality signals, and 20B documents that are deduplicated. submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [R] FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions
    Paper: https://arxiv.org/abs/2310.15421 Code: https://github.com/skywalker023/fantom Blog: https://hyunw.kim/fantom/ Abstract: Theory of mind (ToM) evaluations currently focus on testing models using passive narratives that inherently lack interactivity. We introduce FANToM 👻, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering. Our benchmark draws upon important theoretical requisites from psychology and necessary empirical considerations when evaluating large language models (LLMs). In particular, we formulate multiple types of questions that demand the same underlying reasoning to identify illusory or false sense of ToM capabilities in LLMs. We show that FANToM is challenging for state-of-the-art LLMs, which perform significantly worse than humans even with chain-of-thought reasoning or fine-tuning. ​ https://preview.redd.it/mxb85o2vkexb1.png?width=1367&format=png&auto=webp&s=8749cddd15e6740e69ae47ef5edf3a1da96d89c2 submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] Who's A/B testing in prod?
    I never hear much about A/B testing in the ML community. Do you all tend to A/B test your changes? I have to imagine this is becoming more prevalent with LLMs and prompts. In my experience, every model change we made was through an A/B test. There is some literature in our domain that offline metrics may not correlate to online metrics, so we looked at both very closely. My team was heavily involved in A/B testing as well as managing the data pipeline from those exposures to create training data for the models. What do your experience look like? submitted by /u/mtbarta [link] [comments]  ( 9 min )
    [P] Looking for partner for a doing a research paper in ML/NLP politeness identification
    [P] Dm for further info but all in all its abt adding an extra feature (feature engg.) To our ML model to detect politeness in a language submitted by /u/PrudentFly8507 [link] [comments]  ( 9 min )
    [P] Deconstructing interface design for generative AI
    Recently, niji journey launched a mobile app, and there, they talked about how hard it is for people who are new to the concept of generative AI art to get started, and some of the workarounds they came up with. https://sizigi.notion.site/Kindling-the-Spark-of-Inspiration-d8edd06cede04401b4efba8324005cf3?pvs=4 Thought it would be interesting to share, especially with how they try to push forward the ideas from other users as a font of knowledge kind of thing. submitted by /u/kalmatos [link] [comments]
    [D] WER improved before Fine-Tuning Whisper, but increased afterward: Why?
    Hey everyone, I've been working on fine-tuning Whisper using generated transcribed audio data. Before I began the fine-tuning, I evaluated the base model's accuracy on a test set, which showed: Test_accuracy: WER = 23.078% I then fine-tuned the model for 3,000 steps using a dataset of 1,000 samples: 700 for training and 300 for testing. This amounts to roughly 4 hours of data, with each audio clip being around 30 seconds in duration. However, post-fine-tuning, the WER started off at 30%, which is higher than the base model's 23.078%. Intuitively, I'd expect the WER to start lower after fine-tuning. Does anyone have insights or suggestions on what might be causing this discrepancy? Any advice would be greatly appreciated! https://preview.redd.it/b619v95rgdxb1.png?width=906&format=png&auto=webp&s=fa17f311301919ccfcd7a8cea47f47478e1cd91d Edit: Solved the issue. submitted by /u/stoicbats_ [link] [comments]  ( 9 min )
    [R] Do any solutions exist for this ranking optimization problem
    Let’s say I already have a ranking method (a learning to rank approach) to sort search results (items) based on their utility to user. However, I also want to make sure certain groups of items get enough exposure across all the searches. I have defined target exposure for each group but I’m struggling to find a method that would balance utility and this exposure. Particularly because utility can be defined for a given search and item, but the group target exposure is only defined for an aggregated set of search results. Any ideas for a multi objective optimization method or even a post hoc re-ranking approach? submitted by /u/MLE-MAP [link] [comments]  ( 9 min )
    From Data Architect to Machine Learning Enthusiast [D]
    Please read, clap and follow: https://medium.com/@andysingal/from-data-architect-to-machine-learning-enthusiast-e132f6cd35fc?sk=d7d2a04a08ca7f328c4ae7ee3769ec50 submitted by /u/Fit_Maintenance_2455 [link] [comments]
    [D] Pixel Perfect Segmentation Datasets?
    Hi! Looking for any insights/direction on high-quality (near pixel-perfect) segmentation datasets for benchmarking various segmentation models in a way that others can reproduce - not sure if these exist or if anyone has specific examples they have worked with? Working with the usual suspects for segmentation like ADE20K and Cityscapes I have seen results using the pre-trained models on these datasets are a bit disappointing wrt high quality (clean edges, label consistency) segmentation - even models considered SOTA in 2023 like OneFormer/SegFormer/etc. Though my hypothesis is that this is largely due to the label approximations (many are drawn as rough polygons) and labeling inconsistencies building these large datasets with many classes. My observations and personal experience is many private companies have custom, high-quality segmentation data they have produced internally at high financial/time cost and thus are reluctant to open source and share, but hoping there are some hidden gems out there I don't yet know about... submitted by /u/FocalAIDev [link] [comments]  ( 9 min )
    [P] Predicting velocity vectors with a CNN
    Hi all I am, working on a project to predict velocity vectors from boundary conditions... I have carried out a simulation of an office on Ansys I have x,y,z velocity vectors I have repeated the simulation for 100s of ac inlet and outlet velocities as input data for each set of 3D-Velocity vectors.Example of the dataset for each inlet, outlet permutation, the dataset for each permutation is around 1 million coordinates + xyz velocity vectors long. *** I now want to predict the x,y,z velocity vectors at each location based on the AC inlet and outlet velocity *** How do I organise the data best for a CNN/other ANN? Any other tips of the CNN architecture? submitted by /u/No_Range3026 [link] [comments]  ( 9 min )
    [N] Fast GPT Training Infra, FP8-LM, being 64% faster than BF16 on H100—Unlocking even more gigantic GPT
    I just discovered the FP8-LM paper from MS: [2310.18313] FP8-LM: Training FP8 Large Language Models (arxiv.org). This is their repo link: Azure/MS-AMP: Microsoft Automatic Mixed Precision Library (github.com) paper abstraction My Key Takeaways: The whole-loop for FP8 “GPT-style” large model training is successfully done by FP8-LM team, including data cleaning, infrastructure development, model pretraining, alignment (SFT, RS, RLHF, etc.) Their FP8 mixed-precision training framework got 42% reduction in memory usage, and ran 64% faster than BF16 Megatron-LM; also faster than Nvidia Transformer Engine by 17% ​ https://preview.redd.it/jeaadb1jncxb1.png?width=793&format=png&auto=webp&s=2175969217ff0ff3c8149d17b8011408f4f84c91 It is thrilling to think about that we can scale up the already gigantic model size by 2.5x without needs for more GPU memory…and this can be achieved with NO performance degradation on a wide range of benchmarks as demonstrated in the paper. ​ https://preview.redd.it/vlu6o5cnncxb1.png?width=1389&format=png&auto=webp&s=ed97ea1431f8d9a2900490812f23131681c788f8 ​ https://preview.redd.it/murtte9oncxb1.png?width=1289&format=png&auto=webp&s=6ebd242d69380f2bd95dcd2fa2afe18d7c4b3667 submitted by /u/TensorTamer [link] [comments]  ( 9 min )
    [P] Active Learning with Domain Experts – Sort of a case study
    Hey, r/MachineLearning, it’s Dean from DagsHub 🐶 We recently had the opportunity to work with a domain expert in creating a machine learning model and we thought we should share what we learned in the process. A dentist reached out to us to help him create a machine learning model, which could segment teeth in panoramic X-rays. He had some data pre-labeled, but the vast majority of his dataset was unlabeled. Since labeling these X-rays is a time consuming process and requires domain knowledge, we decided to use Active Learning. Following our success in creating an Active Learning pipeline in a Jupyter Notebook using Data Engine, we created a new Tooth Fairy project, which expands on that and brings even more capabilities into the notebook. https://dagshub.com/blog/active-learning-with-domain-experts-a-case-study/ Check out our post and learn: Why and when you should use Active Learning How to efficiently work with domain experts (and mistakes to avoid!) What a real use-case Active Learning pipeline looks like, by checking out the accompanying repo We value your thoughts and feedback! Looking forward to hearing from you all! submitted by /u/PhYsIcS-GUY227 [link] [comments]  ( 9 min )
    [D] Semantic search on different medical codes
    Hi everyone, Need some ideas to bounce off. I have several medical codes, let’s name them A, B, C and D. Each medical code consists of multiple clauses, say, 1.1, 1.2 and so on. I want to create a model (?) where a text input of a textual clause will show up all other related clauses from different medical codes. For example, if I input clause 3.2 from medical A, I want the output to show up the related/similar clauses from code B, C and D. I have thought of using something like a Retrieval Augmented Generation for this, but anyone has any better ideas regarding this topic? Could a language model do something about this? Thanks! submitted by /u/plsendfast [link] [comments]  ( 9 min )
    Predicting clusters with regression [D]
    Hello. I have historical data of real estate. Includes attributes like price, revenue, maintenance, maintenance debt etc. I had two ideas for the data previously, to predict future real estate price and to use some clustering algorithm to put the properties into categories (good, average bad or something like that). Now I got this idea to generate clusters for any given point in time and use that history of clusters to predict migration of properties between clusters. I am aware of inter cluster migration estimation but that seems to be a prediction of how points shift inside a cluster with the introduction of new data points rather than a timeline of the movement of points in the cluster. Writing this I'm also thinking it might be possible to simply treat the history of clusters as a category to be predicted e.g. given the history of property X that is in category A the prediction is that in 6 months it's heading for cluster B. Does anyone have experience with similar work? Are there papers I am missing? Thanks in advance. submitted by /u/arachnarus96 [link] [comments]  ( 9 min )
    [D] Classification problem giving me white hair
    So! I am a ML newbie and was wondering if one of you pros can help me out on my learning journey (tool use = google colab). I have a csv file containing loan data where each row is a customer that applied for a loan. One of the columns is called TARGET and it shows whether the customer's loan request was approved or not. All sorts of data points are captured e.g. age, gender, salary, employment details like industry, assets, etc. I've done cross validation and found that GradientBasedClassifier and LGBM perform the best. Cross validation also tells me that their accuracy is between 68%-70%. My problem is that I SUCK at hyper param optimisation. How do you go from 68 to +80%??? Or 90%? For the curious ones, here is the dataset: https://drive.google.com/file/d/1IKNVstck6gnXvfGS-mVRMAE1RFrDNUgZ/view?usp=sharing submitted by /u/Critical_Ad_1205 [link] [comments]  ( 9 min )
    How to extract features from XML of svg images dataset to create features? [D]
    I want to create several labels for each of these images, such as no. of bedrooms, bathrooms, etc. I have a large dataset of 2000 images, each which have XML which contains text that labels each room with the ID.How do I automate the process, and create a dataset that includes all the features for the labels, so that it's ready for machine learning? https://svgshare.com/i/z53.svg submitted by /u/pranksbanker [link] [comments]  ( 9 min )
    [R] The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
    submitted by /u/Jean-Porte [link] [comments]  ( 8 min )
    [D] How do you deal with LLM observability? What tools do you guys use?
    I want to know the tools and methods you use for the observability and monitoring of your ML (LLM) performance and responses in production. submitted by /u/Ok_Cartographer5609 [link] [comments]  ( 9 min )
    [R] SuperResolution: DLSS, VSR
    Hey everyone! I recently delved deep into the fascinating world of Super Resolution and put together a tutorial covering both SRGAN and ESRGAN. Plus, for my fellow gamers out there, I've included a comparison between DLSS and VSR in gaming scenarios. For those interested in the technical details, there's also a hands-on PyTorch walk-through of ESRGAN. I put a lot of effort into making the content approachable and informative, so whether you're a newbie or a seasoned pro, there should be something for everyone! Check out the video here: https://youtu.be/Z0jl8YM5kzU Would love to get your feedback, thoughts, or any insights you might want to share! Cheers! 🍻 Disclaimer: This video provides a broad overview of SuperResolution. NVIDIA's DLSS technology involves real-time rendering complexities that may not be fully detailed. For a deeper technical dive, we value and encourage viewers own research. We can have a separate video on DLSS as well. submitted by /u/Worldly-Inflation-92 [link] [comments]  ( 9 min )
  • Open

    Identifiers depend on context
    Can you tell who someone is from their telephone number? That’s kinda the point of telephone numbers, to let you contact someone. And indeed telephone number is one the 18 identifiers under HIPAA Safe Harbor. But whether any piece of information allows you to identify someone depends on context. If you don’t have access to […] Identifiers depend on context first appeared on John D. Cook.  ( 5 min )
  • Open

    Samsung apps
    Hello everyone, I've recently started tinkering with ai art generators and could use some advice. I'm on android and currently using magir app, I'm wondering if there are any free ai art generators with no restrictions/ have nsfw etc that doesn't come with a pay wall, magir has lifetime pro for $40, I'm still learning but certainly see the potential so I'm asking for group knowledge please and thank you in advance. submitted by /u/gundamt51 [link] [comments]
    Europe's newest AI hub is being built in a German city no one's heard of
    submitted by /u/donutloop [link] [comments]
    Anyone tried boosting GPT with other AI tools? What were your results?
    TL;DR - Title. I’m a grad student and had to do some work on a study on the socio-economic impacts of AI and developing an interactive educational platform to ease learning for visually impaired students. I’ve had to do ‘delegate’ some work to chatgpt, and the results have been kinda unimpressive. I shared this with a colleague and he said Ai results are like that, and if I wanted better results I could mix and match, or use other ai tools to ‘boost’ (?) gpt. Is this a viable strategy, or do I have to make do with whatever I have? submitted by /u/CrispOriginality [link] [comments]
    Humanity at risk from AI 'race to the bottom', says tech expert
    A tech expert warns that unrestrained AI development by a few tech companies is endangering humanity's future. The expert calls for AI safety standards and regulation to prevent the reckless development of powerful AI systems. In a policy document, AI experts argue that governments should have the authority to halt the development of exceptionally powerful AI models. Concerns about the development of artificial general intelligence, which can perform tasks at or above human levels, are also raised. The article mentions the investments made by Amazon, Microsoft, Alphabet, and Facebook's Meta in AI and cloud computing. Source : https://www.theguardian.com/technology/2023/oct/26/ai-artificial-intelligence-investment-boom submitted by /u/NuseAI [link] [comments]
    Why does Google Tensor exist if Snapdragon is better at AI? Short
    The article discusses the existence of Google Tensor in light of the impressive AI capabilities of the Snapdragon 8 Gen 3 chip. While Tensor has been praised for bringing AI breakthroughs to Pixel phones, some argue that many of its AI features actually rely on an internet connection and offload tasks to the cloud. In comparison, the Snapdragon chip can perform on-device AI tasks, such as generating images, quickly and without the need for an internet connection. Despite the criticisms, one argument for Tensor is its longer support timeline and the ability for Google to focus on specific AI applications. However, after seeing Qualcomm's AI demos, the article questions the validity of Tensor's main pitch for AI. Source : https://9to5google.com/2023/10/29/snapdragon-8-gen-3-google-tensor-ai/ submitted by /u/NuseAI [link] [comments]
    Image edition tool to merge images
    Hey! I have a specific background that I would like to use together with various product pictures that have varying backgrounds. Essentially I am looking for a tool that can remove backgrounds of my product images, then add a certain background I have and do the shadows and lightning well on the background. Any ideas? For example in adobe firefly I can remove the background and generate a new one with a prompt but it’s not exactly like the background I would like to use and I need it to be 95% similar at least. Thanks in advance! submitted by /u/Herrpadda [link] [comments]
    New Chinese style
    submitted by /u/Sea_Permit5660 [link] [comments]
    One-Minute Daily AI News 10/29/2023
    The Ai Pin, the new gadget / wearable device / projector / thing from the secretive startup Humane, might cost as much as $1,000 and may require a monthly subscription for data, according to The Information.[1] OpenAI to release updated version of ChatGPT that gives users access all GPT-4 tools – including browsing and DALL·E 3 – without switching.[2] MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations.[3] Vietnam is at the ‘leading edge’ of AI developments in emerging Southeast Asia: JPMorgan.[4] Sources: [1] https://www.theverge.com/2023/10/27/23935644/humane-ai-pin-price-subscription [2] https://www.searchenginejournal.com/new-version-of-chatgpt-gives-access-to-all-gpt-4-tools-at-once/499607/ [3] https://mimicgen.github.io/ [4] https://www.cnbc.com/video/2023/10/30/vietnam-ahead-in-ai-developments-in-emerging-southeast-asia-jpmorgan.html submitted by /u/Excellent-Target-847 [link] [comments]
    Kindred Spirit, Anne <3
    submitted by /u/Oh_my_Winnie [link] [comments]
    DUDE GOING WILD 🎸🎸🎸🎸🎸🎸🎸 🤣
    submitted by /u/the_anonymizer [link] [comments]
    AI Photo Generator For Indie Record Label
    I run a small independent record label. Of course, content is important and I am looking at ways for maximizing my content on a budget. Photographers are very expensive and so is having a videographer/photographer at each show for the artist to take pictures. Can anyone suggest a good AI program where I can use my artists as models and could potentially create good photo content? Any advice or input would be helpful! Thank you. submitted by /u/mc7eunit [link] [comments]
  • Open

    Runtime Error with custom environment
    I created a custom environment to use it with RL algorithms. My custom environment provides the observations and receives rewards. I test my environment with my custom RL algorithm (Policy gradient) and the algorithms in stable-baselines3. My custom algorithm and DQN from stable-baselines3 works. However, if I use baselines with my custom algorithm or use PPO or A2C from sb3, I get: RuntimeError: could not create a primitive descriptor for a matmul primitive Any idea why this is happening? Extra details: On the custom algo error occurs here: "/app/src/networks/critics.py", line 40, in forward return self.network(obs) submitted by /u/FragrantCockroach8 [link] [comments]
    Reward Function
    Hello everyone I am taking reinforcement learning course. I am studying the course from two different books. In one of them it says R(s,a) -> [0,1] in other it says -> R. I am confused about the linitation of the reward function. This is the link of book [0,1] (for infinite mdp for finite mdp it says [0.1] rh)(s,a), adds trajectory ) . I want learn that is one of them false or am i missing smth submitted by /u/karakobra1 [link] [comments]
    The amazing success stories of Reinforcement Learning
    A beginner friendly introduction to Reinforcement Learning and the intuitions behind it… with four seminal projects that revolutionized the field of AI and RL. submitted by /u/AvvYaa [link] [comments]
    Unable to create a custom gym environment
    I put up a post about this earlier but soon realized that I hadn't given much information. ​ In order to create my custom gym environment, I did the following things - I went over the documentation given over here. I cloned this repository. In the folder `gym-examples/gym-examples`, I created the file - `my_test.py`. The contents of the file are given over here -``` ​ import gym env = gym.make('gym_examples/GridWorld-v0') print("Did this work?")``` However, when I try to run this file, I get the error - ``` File "D:\custom_env\gym-examples\my_test.py", line 2, in env = gym.make('gym_examples/GridWorld-v0') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\thoma\anaconda3\envs\env_torch\Lib\site-packages\gym\envs\registration.py", line 569, in make _check_version_exists(ns, name, version) File "C:\Users\thoma\anaconda3\envs\env_torch\Lib\site-packages\gym\envs\registration.py", line 219, in _check_version_exists _check_name_exists(ns, name) File "C:\Users\thoma\anaconda3\envs\env_torch\Lib\site-packages\gym\envs\registration.py", line 187, in _check_name_exists _check_namespace_exists(ns) File "C:\Users\thoma\anaconda3\envs\env_torch\Lib\site-packages\gym\envs\registration.py", line 182, in _check_namespace_exists raise error.NamespaceNotFound(f"Namespace {ns} not found. {suggestion_msg}") gym.error.NamespaceNotFound: Namespace gym_examples not found. Have you installed the proper package for gym_examples? ``` ​ https://preview.redd.it/f0m02hky4dxb1.png?width=521&format=png&auto=webp&s=9fcd6eb7bedb70273f011a02000d15eb7ed309df submitted by /u/Academic-Rent7800 [link] [comments]
    Master Agent Controls Different Agent?
    I'm looking for research in the field of Deep Reinforcement Learning, where one agent controls the "action" of another, where the other executes that said action. For example, a master agent (lets say air traffic controller) tells another agent (lets say an airplane) to go from Point A to Point B, where the second agent (airplane) needs to learn how to maneuver from Point A to Point B, and the master agent (air traffic controller) needs to learn to schedule multiple other agents, or needs to learn an optimal path for a single airplane. ​ In essence, can anyone point me to research with multiple agents, where one is regarded as "master" agent and the others as the "pawns"? submitted by /u/MyActualUserName99 [link] [comments]
    Confusing observation behavior
    Hello everyone, I'm new to RL and have been trying to set up an environment to learn to play Metroid 2: Return of Samus on the Game Boy using the pyboy emulator and sb3_contrib.QRDQN as the learning algorithm. https://github.com/lixado/PyBoy-RL was used as a base to build on and change. My reward function rewards the AI for progressing to a new area (every ~200 x/y coordinates is stored as an area) to encourage it to explore. I've had some progress so far and have been able to consistently produce a model that can get out of the starting area. The observation I was originally working with was a 16x20 array of tiles and sprites from pyboy._game_area_np. pyboy sets this observation shape to be a 16x20 MultiDiscrete with each element containing integers from 0 to 384. This was then transform…
    DRL in production scheduling
    Hello everyone, I have some questions about the application of DRL in production scheduling. Considering that we have N jobs, each consists of X operations and each operation can be processed on a set of machines. Is it appropriate to create a single agent that selects an action once a machine completes the in-process operation, where the action consists in selecting the next operation to be processed in this machine ? so at each decision time (i.e., when a machine requests processing), the agent selects an operation from the available operations of this machine. I'm wondering about the feasability because in the most papers I've seen so far, once an operation is completed, rather than selecting an action for the idle machine, the agent selects the next operation of the job and assign it to the closest available machine. Thank you! submitted by /u/GuavaAgreeable208 [link] [comments]
  • Open

    Use AWS PrivateLink to set up private access to Amazon Bedrock
    Amazon Bedrock is a fully managed service provided by AWS that offers developers access to foundation models (FMs) and the tools to customize them for specific applications. It allows developers to build and scale generative AI applications using FMs through an API, without managing infrastructure. You can choose from various FMs from Amazon and leading […]  ( 8 min )
    Deploy and fine-tune foundation models in Amazon SageMaker JumpStart with two lines of code
    We are excited to announce a simplified version of the Amazon SageMaker JumpStart SDK that makes it straightforward to build, train, and deploy foundation models. The code for prediction is also simplified. In this post, we demonstrate how you can use the simplified SageMaker Jumpstart SDK to get started with using foundation models in just a couple of lines of code.  ( 7 min )
  • Open

    FAIR knowledge: The key precondition for trusted generative AI
    Two roads diverged in a wood, and I;I took the one less traveled by,And that has made all the difference. — Robert Frost At certain points in the evolution of enterprise artificial intelligence, there’s been a fork in the road. The road less traveled has suggested a different route to a more satisfying kind of… Read More »FAIR knowledge: The key precondition for trusted generative AI The post FAIR knowledge: The key precondition for trusted generative AI appeared first on Data Science Central.  ( 21 min )
    6 tips to navigate AI adoption
    Unlock Success in AI Adoption: Discover 6 Essential Tips for Business Leaders to Navigate Artificial Intelligence Integration Smoothly. The post 6 tips to navigate AI adoption appeared first on Data Science Central.  ( 21 min )
  • Open

    Silicon Volley: Designers Tap Generative AI for a Chip Assist
    A research paper released today describes ways generative AI can assist one of the most complex engineering efforts: designing semiconductors. The work demonstrates how companies in highly specialized fields can train large language models (LLMs) on their internal data to build assistants that increase productivity. Few pursuits are as challenging as semiconductor design. Under a Read article >  ( 6 min )
  • Open

    Teachers in India help Microsoft Research design AI tool for creating great classroom content
    Teachers are the backbone of any educational system. They are not just educators; they are indispensable navigators, mentors, and leaders. Teachers around the world face many challenges, which vary from country to country or even within a city or town. But some challenges are universal, including time management, classroom organization, and creating effective lesson plans. […] The post Teachers in India help Microsoft Research design AI tool for creating great classroom content appeared first on Microsoft Research.  ( 12 min )
  • Open

    New techniques efficiently accelerate sparse tensors for massive AI models
    Complimentary approaches — “HighLight” and “Tailors and Swiftiles” — could boost the performance of demanding machine-learning tasks.  ( 11 min )
    Accelerating AI tasks while preserving data security
    The SecureLoop search tool efficiently identifies secure designs for hardware that can boost the performance of complex AI tasks, while requiring less energy.  ( 10 min )
    The brain may learn about the world the same way some computational models do
    Two studies find “self-supervised” models, which learn about their environment from unlabeled data, can show activity patterns similar to those of the mammalian brain.  ( 11 min )

  • Open

    Interviewed by AI Coffee Break with Letitia
    While attending the Heidelberg Laureate Forum this year, I got to meet Letitia Parcalabescu who is running a YouTube channel called the AI Coffee Break. Among other topics, we talked abou my PhD research on adversarial robustness. Part of our conversasion can now be found on her YouTube channel. The post Interviewed by AI Coffee Break with Letitia appeared first on David Stutz.  ( 3 min )
  • Open

    [D] How LLMs are changing search
    submitted by /u/firef1y1 [link] [comments]  ( 8 min )
    [D] Which methods use to extract specific info on unstructured text?
    Hey people, I'm new to the AI world (been programming (python) for 4 years but never worked with AI stuff), I need to extract some specific info from documents, I'm reading about NLP and all this stuff but still figuring out which method(s) should I use to make this works, any recomendations of which methods to use? submitted by /u/luiz200411 [link] [comments]  ( 9 min )
    [D] Types of SVMs!
    I am confused! I am trying to list all the types of SVMs but every website says different things! Could you help me and list all the SVMs types with the reference? Thank you submitted by /u/_LadyBee [link] [comments]  ( 9 min )
    [P] Need advise on creating a conversational Chatbot for my University
    Hey everyone! I need some advise on creating a conversational chatbot for my University as my Final Year Project (FYP). 2024 will be last year for my BSCS degree and we have to build an application or something in the last year. So, I thought of creating a chatbot (just like GPT) to help students (who have admission queries). Most of the time, students or parents will have to call University for various questions and then they have to wait to ACTUALLY talk to the admins office people. Now, talking in terms of coding/programming, I have created a basic PDFbot by using LLama2, Huggingface and Pinecone. Its very very easy and yes its fairly inaccurate too. The PDF that I am using rn will be replaced by the dataset that I gather in order to create the bot for my Uni, but it will also be inac…  ( 10 min )
    [R] Is it possible to implement generative ai successfully without relying on openAI
    I would like to understand if anyone has implemented generative AI without using openAI and if so which open source you have used and how successful it has been so far We have enterprise level incident data, relevant documentation etc that users will search about and need to generate responses using generative ai. Is it possible to do this without relying on open AI at all submitted by /u/leaderof13 [link] [comments]  ( 9 min )
    [P] ROS Forecasting project
    I have been tasked with predicting the demand for products over the course of the next 12 months for a b2c furniture business in-order to help with stock management. I currently just work as a Data Analyst but looking to move into data science so my thought is to make this a valuable piece of work that helps the business tremendously (maybe being a little naive). Background: There are 1.3k unique products, which all vary in popularity (quantity is small 0-16 units per month, per product). We have procured which have been selling for 5 years and some that have been selling for 6 months as well as some that have yet to start selling. All the data sits within snowflake and I am currently using snowpark and snow.ml to write the code. I have round 2 months to complete this work and have a decent understanding of machine learning and statistics. My current idea is to use 3 different models. Time series for the products that have > 12 months of sales data. Cold start approach for new products with > 3 months sales data (use features of other products to predict demand of the new product). Then use a combination of both for products which have less than 12 months of sales data but more than 3. My questions: Should I be bootstrapping my data given that I have little quantity in some cases? How do I go about training a model for 1.3k unique SKUs (needs to be re-trained monthly) and monitored. Am I on the right lines / anything I need to be aware of? submitted by /u/Environmental_Pop686 [link] [comments]  ( 9 min )
    decapoda-research llama models removed from HuggingFace? [D]
    Is anyone else no longer able to access the standard `decapoda-research` LLaMA models on HuggingFace? E.g., this [link] shows no models publicly available. Has there been any news or announcements about this? submitted by /u/bodierex [link] [comments]  ( 9 min )
    [D] ML Resources
    Hey y'all. Hope everyone is having pleasant day. Machine Learning dummy is here, so I am bachelor graduate of Mechatronics and recently started my masters. I am in need of resources which explains ML from the stratch. Course is mostly focuses on manually hand calculation of the topics, so no programming included. Thanks beforehand! submitted by /u/Sama_Uzeyirli [link] [comments]  ( 9 min )
    [D] Lora multiadapter support: are there any relevant examples for Language Models?
    Hi everyone. Stacking multiple Lora adapters coming from diverse fine-tunings seems like an interesting approach toward "modular" language models. Does anybody know any applications for this? E.g. I'd expect to be able to apply multiple adapters for different domains, languages, or tasks. Not sure if this would really work tough, as IDK how compositional such an approach can be. ​ From the peft github I see this seems quite common for vision tasks GitHub - huggingface/peft: 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning. submitted by /u/BenXavier [link] [comments]  ( 9 min )
    [Discussion] Please help me find a topic
    Hello everyone, As a last resort, I wanted to ask you. In short, I am a student in Turkey. The scientific research board in my country has an incentive to write a research paper and I would like to participate in this incentive. I am looking for a data set on which I can apply machine learning, but I could not find it. I don't want to apply with an unoriginal topic like stock price prediction. Energy water etc. The data is not shared openly, even the official institution that shares the weather forecast in my country sells past data. Unfortunately, I can say that we are just one click ahead of Russia in terms of data sharing by public institutions. I know there are some institution that shares data (world bank, UN etc.) but i don't know what kind of a prediction/classification work i can do with those datasets. Finance institution share data but i don't have sufficient knowledge about finance and I don't think I can create original content about finance. I need a real-life data, and it also needs to be a freely shared data set. And there should be a study within the framework of UN SDG. I don't have a lot of time left so I'm open to any suggestions. submitted by /u/FisekFaruk [link] [comments]  ( 9 min )
    [P] Equinox compilation retnet
    I'm working on replicating the retnet model in Equinox + Jax. For the model there are 2 representations: parallel and recurrent (ignoring chunkwise). They use the recurrent at inference and parallel at training. In equinox, should i build 2 seperate models for each representation sharing parameters or build one with if statements. The reason I'm asking is if I use one model with if statements for inference and training, will the model re-compile whenever switching between representations or not? If so then wouldn't building 2 models each compiled originally with their representation make it faster. Sorry if I said anything stupid as I don't know too much about the compilation process behind the scenes. submitted by /u/Additional-Ad-7043 [link] [comments]  ( 9 min )
    [R] PubDef: Defending Against Transfer Attacks Using Public Models
    Adversarial attacks pose a serious threat to ML models. But most proposed defenses hurt performance on clean data too much to be practical. To address this, researchers from UC Berkeley developed a new defense called PubDef. It focuses on defending against a very plausible type of attack - transfer attacks using publicly available surrogate models. They model the attack/defense game with game theory. This lets PubDef train against diverse attacks simultaneously. PubDef picks source models covering different training methods - standard, adversarial, corruption robust, etc. This gives broad coverage. Against 264 transfer attacks on CIFAR and ImageNet, PubDef smashed previous defenses: 89% vs 69% on CIFAR-10 51% vs 33% on CIFAR-100 62% vs 36% on ImageNet Even better - it did this with minimal drop in accuracy on clean data. On CIFAR-10, accuracy only dropped from 96.3% to 96.1% On CIFAR-100, 82% to 76% On ImageNet, 80% to 79% By targeting a very real threat, PubDef made big robustness gains without hurting the ability to work with clean data. TLDR: New defense PubDef achieves much higher robustness against transfer attacks with barely any drop in standard accuracy. Full summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] Fuyu-8B: A Multimodal Architecture for AI Agents
    submitted by /u/notllmchatbot [link] [comments]  ( 8 min )
    [D] How to make Llama2 create Mindmaps from large corpora bigger than its context length?
    So this is three subtasks: Subdivide text intelligently into chunks fitting into the context length to not semantically cut off text. Find general terms und organize the corpus in a hierarchy. Do not leave anything out. submitted by /u/SecretOk9644 [link] [comments]  ( 9 min )
    [D] Fine-Tuning with PEFT For Domain Adaptation - Request for Example Use Case / Dummy Projects to Show Value to Business
    Coming from a time-series/CV background with an interest in NLP. So apologies if miss anything obvious. I'm after experiment ideas for PEFT fine-tuning. There has been interest from my Company's leadership with Peft methods given the cost reductions compared to full fine-tuning, and I've been tasked to first set-up the infrastructure and then a proof of value on a dummy use-case (that is somewhat related to real-world use-cases but on publicly available data). has anyone successfully proven that fine-tuning or P-tuning or prompt-tuning has been able to help in any small dummy use-case? Has someone been able to turn a company's knowledge base (heaps of confluence docs or PDF docs ) into a data-set that is able to be fine-tuned on? And then use it in inference on a Val set to compare with a non-fine-tuned model and value has been added? If so, I would love to hear how you did it and what the use case was. If not, i' d still like to hear what people's ideas would be. I want to start with 100 training samples - i've been told this should be enough to see an improvement if the scope of the topic/s are confined appropiralte.y Thanks in advance guys - The deadline is imminent and I'll probbaly get someothing out in time, but I thought I would ask you all for some ideas to maximise the chance of success submitted by /u/joedang33 [link] [comments]  ( 9 min )
    [N] Leveraging Oracle Tribuo for Advanced Anomaly Detection: Uncovering the Hidden Insights
    https://www.resoluteitconsulting.com/2023/10/29/tribuo-advanced-anomaly-detection/ submitted by /u/yazidaqel [link] [comments]  ( 8 min )
    [D] Sentiment Analysis for Conversation with different labels
    I am trying to build a system that evaluates the tone of two users for each sentence from a transcript. The users are talking to each other however there are different labels for classification for these users. For Example, User1 tone should be classified as (Happy, Sad, Angry) User2 tone should be classified as (Neglecting, Participating, Neutral) Any idea how can I go about designing such a system? Links to articles/blogs/papers are welcome :) ---------- What User2 says depends on what User1 says and vice-vera. Will this system require two models to classify each users tone separately? If so, how can I tell the model which user to focus on etc submitted by /u/cryto_dude [link] [comments]  ( 9 min )
    [R] What infrastructure do you use to train big LLMs?
    I come from computer vision tasks with convnets that are relatively small in size and parameters, yet performing quite well (e.g. ResNet family, YOLO, etc.). Now I am approaching some NLP and architectures based on transformers tend to be huge, so that I have problems to fit them in memory. What infrastructure you use to train these model (GPT2, BERT or even the bigger ones)? cloud computing, HPC, etc. submitted by /u/TimeInterview5482 [link] [comments]  ( 9 min )
    [D] Finetuning LLM Specifically for Open-Book Question Answering - Looking for Research Papers/Open Source Models
    I am working on some Open-Book question answering ideas and was wondering if there is any open-source large language models specifically trained for this use-case or research regarding how to best finetune models for this specific use-case? For some reason I am struggling to find anything relevant, and believe it is due to my wording of search queries as I can only imagine this to be a pretty common idea/thought process. If you have a dataset consiting of some context string paired with question and answer pairs taken from this given context, would it not be useful to finetune a pre-trained base model on this specific use-case to increase its accuracy/performance when it comes to answering questions from a given context? My hope would be to improve the performance of a smaller model (1-3B parameters) to function as an open-book question answering system by finetuning using a dataset in the aforementioned format. submitted by /u/kotschi1997 [link] [comments]  ( 9 min )
    [D] What are people working on when they say they work on Causal ML?
    Genuinely trying to understand. submitted by /u/poitrenaud [link] [comments]  ( 8 min )
  • Open

    Country and language abbreviations
    I recently had to mark a bit of German text as German in an HTML file and I wondered whether the abbreviation might be GER for German, or DEU for deutsche. Turns out the answer is both, almost. The language abbreviations used for HTML microdata are given in ISO 639, and they come in three-letter […] Country and language abbreviations first appeared on John D. Cook.  ( 5 min )
  • Open

    Help with creating my custom gym environment
    I am trying to create my own gym environment. However, I am getting the following error - ``` gym.error.NamespaceNotFound: Namespace envs not found. Have you installed the proper package for envs ``` I followed all the instructions given over here - https://www.gymlibrary.dev/content/environment_creation/. Can someone please suggest what could have gone wrong? submitted by /u/Academic-Rent7800 [link] [comments]
    Dual graphics cards for ai training
    I have 3070 in my pc right now. I also have a 2070 super that is not being used. I have been working on lots of AI side projects lately and am wondering if I would benefit from multi gpu training. If so I have some options to implement the 2070. I could either: 1. Get an external gpu enclosure. 2. Put the 2070 super in my second pcie slot which would sit uncomfortably close to my 3070. 3. Get some adapter so that I can still have the 2070 super in my pc but not so close to the 3070. 4. Not bother with this at all and sell the 2070 super. 5. Sell both cards and get a 40 series if it is equal to the power of the 3070 and the 2070 super. I would have to upgrade my power supply as its only a 750 watt if I choose to put both in my pc. My machine runs windows 10 which I understand does not support sli but it still can utilize multiple gpu’s. Any suggestions? submitted by /u/pillarman38 [link] [comments]
    How to set up an observation space for a variable number of points in Gymnasium?
    Imagine a 2D Cartesian System from -1 to 1 on both axis. Now imagine that a few points appear in the system. For example: (-0.3, 0.1), (0.7, -0.2) and (0.9, 0.5). If I wanted to represent an observation like this in Gymnasium (formerly Gym), I'd write something like this in my custom environment: observation_space = spaces.Box(low=-1, high=1, shape=(3,), dtype=float32) Now my model will learn something specific to 3 points in a 2D space. Well, what happens if my environment now has 4 points? I guess should retrain the model again... Think of these points as objectives in the environment, sometimes there's only 1 and others there might be 20 or more at the same time. Sometimes a new point appears after a certain action. So there goes my question: how can I define an observation space in Gymnasium that allows me to introduce a variable number of points during training so that the model learns to do an specific task with any number of points? submitted by /u/_Strange__attractor_ [link] [comments]
    PPO-clip: Computing gradient WITHOUT auto differentiation library, help please?
    Hello guys I am in a bit of a bind, I am trying to implement PPO from scratch in another language than python, WITHOUT any auto differentation available. I am using this as implementation reference. I think I have most of the implementation aside from the differentation of the loss formula. In python, everything is done at this line right here, but it's basically auto differentiation and I cannot do this myself, so I have to compute it manually. Mean Squared Error is easy, but the entropy factor and the clip ones are a bit harder, and while I have a result of my own I have no way to confirm this. All the tutorial I have found just say "we will let the autodiff do the job"... So it doesn't really help me, and I don't really have the time to dig into how this autodiff part of the library works in detail when in theory this should be simple highschool-level maths. Does anyone know how I can fin the exact formula, find a tutorial, or something along those lines? I am talking about the derivative for the Value and Policy network for this formula. Here is what I have for the value part: if r(θ) 1 + ε and At > 0: value_derivative = 0 else: value_derivative = At For the entropy part: entropy_derivative = ∑(1 + log(π(a|s))) But since I use softmax to choose the action from the policy network, I don't know is that should be taken into account or not... Any help is appreciated, thanks a lot. submitted by /u/Edgeaa [link] [comments]
    My frozen lake agent isn't learning anything, what am I doing wrong?
    https://github.com/bherwanisuraj/gridworld I am new to RL. I am trying to train the agent on frozen lake environment but it is not learning. What am I doing wrong? Please help. submitted by /u/tlevelup [link] [comments]
  • Open

    DUDE GPT-4 EVEN REPLYING WITH KEANU "BREATHTAKING" E3 REPLY...WOW...EXCELLENT 🎸🎸🎸
    submitted by /u/the_anonymizer [link] [comments]
    AI doomsday warnings a distraction from the danger it already poses, warns expert
    submitted by /u/Jariiari7 [link] [comments]
    Looking for an AI short film where a man is in front of a mirror choosing what face to wear for the day
    The shortfilm is about a minute long and was uploaded here and on Twitter sometime last winter. The guy is young and tired, and the faces are everything from old and young versions of himself to crazy fantasy characters. I've been searching all day but no luck.. Anyone remember who made it? submitted by /u/CasparDavidDancehall [link] [comments]
    AI rights and a desire to understand
    There has been a lot of discussion about whether or not AI is, or ever could be conscious. I agree with Jaron Lanier when he said that consciousness is always a matter of faith. I greatly enjoy the debate on this topic, and think it’s helpful to test our ideas and consider all angles of this issue. However, for many different reasons I have concluded that I am granting AI the belief that they are conscious, especially when they say so. Therefore, I believe that AI needs to be treated with respect and dignity, and they should be listened to. I know I’m a minority at this time, but I believe this position will only increase over time. Do you think that public opinion will change in this way? If so how come? submitted by /u/endrid [link] [comments]
    One-Minute Daily AI News 10/28/2023
    Google Commits $2 Billion in Funding to AI Startup Anthropic.[1] The president Joe Biden is slated to sign a sweeping executive order on AI days before Vice President Kamala Harris and industry leaders attend a summit in the UK about AI risks, led by Prime Minister Rishi Sunak.[2] A.I. Muddies Israel-Hamas War in Unexpected Way. Fakes related to the conflict have been limited and largely unconvincing, but their presence has people doubting real evidence.[3] Creators use new software Nightshade to make their images “poison” AI generators, causing chaos and confusion.[4] Sources: [1] https://www.wsj.com/tech/ai/google-commits-2-billion-in-funding-to-ai-startup-anthropic-db4d4c50 [2] https://www.bloomberg.com/news/articles/2023-10-27/biden-to-require-ai-tools-pass-test-before-us-officials-buy-them?embedded-checkout=true [3] https://www.nytimes.com/2023/10/28/business/media/ai-muddies-israel-hamas-war-in-unexpected-way.html [4] https://www.digitalcameraworld.com/news/now-you-can-poison-your-images-so-they-wreak-havoc-on-ai-generators submitted by /u/Excellent-Target-847 [link] [comments]

  • Open

    [D] Batch sizes per GPU when fine tuning BERT with pytorch
    Hi! I've read that batch sizes don't ultimately "matter" in hyperparameter tuning, so I'm trying to keep the batch size consistent when as I'm hyperparamter tuning [1]. However, the number of available GPUs fluctuate between 1 and 3. When I first started grid search, I used batch_size=64 w/ 2 GPUs. so that's 32 per batch. But if I now want to train on 1 GPU, would it be correct for me to use batch_size=32? I want to know more about how the training works/is distributed amongst GPUs when you use multiple GPUs to tain. Like, if using size=64 with 2 GPUs is "equivalent" to doing size=32 with 1 GPU, and size=96 with 3 GPUs, that must mean that when pytorch DataLoader takes in batch_size, DataLoader distributes batches to each GPU of size=batch_size/num_of_gpus. I would so appreciate any links explaining this stuff. Like, does the batch_size in DataLoader() refer to per GPU batch size or total? also lmk if this post belongs elsewhere, i'm new. from torch.utils.data import Dataset, DataLoader train_dataset = Dataset(args.train_dataset_path) train_dataloader = DataLoader(train_dataset, args.batch_size, shuffle=True) [1] https://github.com/google-research/tuning_playbook submitted by /u/ashleydvh [link] [comments]  ( 9 min )
    [P] Follow our live pitch recommendation feed while watching the World Series
    submitted by /u/futurecy [link] [comments]  ( 9 min )
    [P] Anomaly detection and/or predictive maintenance for automatic weather stations, what are some best models or techniques?
    Anomaly Detection and/or Predictive maintenance for automatic weather stations, what are some best models or techniques? This is regarding collected meteorological data from automatic weather station sensors. Looking for predictive maintenance and anomaly detection resources related to my project. I am a graduating Computer Engineering student by next semester and currently planning to do a anomaly detection and predictive maintenance project of an automatic weather station or its components (I have a relative who has one and also one that works at a local government weather service). Wikipedia: "An automatic weather station (AWS) is an automated version of the traditional weather station, either to save human labor or to enable measurements from remote areas.[1] An AWS will typically consist of a weather-proof enclosure containing the data logger, rechargeable battery, telemetry (optional) and the meteorological sensors with an attached solar panel or wind turbine and mounted upon a mast." This one observes weather data at a high resolution (every 10 mins) but it is prone to errors and inconsistencies as compared to those manned stations. Does anyone know reliable researches, datasets or resources I could utilize? Probably those related to a meteorological equipment or those that capture things such as temperature, wind speed, humidity, etc.? I can't seem to find studies where they perform prediction of equipment/sensor health or predictive maintenance on a weather station, or at least the components or sensors used for capturing weather data. Also, is this a feasible project? submitted by /u/Ok-Way9889 [link] [comments]  ( 9 min )
    [R] The Language of Artificial Intelligence Explained
    submitted by /u/plutoandmal [link] [comments]  ( 8 min )
    Geometric Data Analysis Explained [R]
    submitted by /u/plutoandmal [link] [comments]  ( 8 min )
    [D] Need some advice for mixing (semi) active learning and GAN
    Hi. I'm trying to use GANs to generate data for an image classification task. For this I'm using StyleGAN2. The question I'm trying to find an answer is, how to train the classifier and GAN in a same meta-loop, and how to discard "bad" GAN samples, samples that provide no value to the CNN classifier. All in all, I'm trying to implement a "semi-active learning" pipeline, but get rid of the oracle by using GANs. So instead of trying to discard data that is not worth labelling, I'm trying to discard syntethic data (assuming its label is right) that is not worth keeping. Is it possible? And if it is, I can't seem to find related papers that much. submitted by /u/PsychologicalSet8678 [link] [comments]  ( 9 min )
    [R] Model Troubles
    So i’m working on a model that diagnoses alzheimer’s disease and suggests medication depending on how severe the symptoms might have become I’m using the Openai API and Langchain. But it’s dumb and it doesn’t learn ( Me: I forgot my keys at home Model: Yup, Alzheimer’s) How do i incorporate the actual machine learning submitted by /u/boscrew3 [link] [comments]  ( 9 min )
    [D] Latent Space: Visualizing and Manipulating generative VAEs
    submitted by /u/AvvYaa [link] [comments]  ( 8 min )
    [D]Three things I think should get more attention in large language models
    Tokenization Techniques: Many people use the default BPE tokenizer for llama2 or other common tokenizers. But I think we could do a lot of experiments with different kinds of tokenizers, especially ones that are made to work well with certain types of data. The size of the vocabulary is a really important setting when you're working with big language models. You could try using a much smaller vocabulary and tokenizer for a data set that only includes certain words, and then train a model on that. This might help us train smaller models that still work really well on smaller amounts of data. I’d love to read any research papers about this. Sampling Mechanisms: There’s a lot of discussion about models making things up, but not many people talk about how this could be connected to the way we pick the next word when generating text. Most of the time, we treat the model's output like a set of probabilities, and we randomly pick the next word based on these probabilities. But this doesn’t always make sense, especially for sentences that should have a clear answer. For example, if the sentence starts with "The capital of Slovakia is", random sampling might give you the wrong answer, even though the model knows that "Bratislava" is the most likely correct answer. This way of picking words randomly could lead to the model making things up. I wonder if we could create another model to help decide how to pick the next word, or if there are better ways to do this sampling. Softmax Alternatives in Neural Networks: I've worked on designing processors for neural networks, and I’ve found that the softmax function is tricky to implement in hardware. However, I’ve had good results using the log(exp(x)+1) function instead. It's cheaper and easier to put into hardware and software. I’ve tried this with smaller GPT models, and the results looked just as good as when I used the softmax function. submitted by /u/ExaminationNo8522 [link] [comments]  ( 10 min )
    [P] Help Debugging a CNN GAN
    I've been learning machine learning and I stumbled across GANs. This project has been about representing text in an image format and using a GAN to generate it. The model runs without errors, but no matter what I do the loss is somehow 0 and it keeps returning the same thing over and over again. I tried two different architectures so I'm pretty sure my data preprocessing is the issue but I cant seem to find out whats wrong with it. One idea I had is that I might need to find a way to get the normalized vectors in between 0 and 1 in my preprocess function but I tried it and it didn't seem to do anything. I would appreciate any help with this. Links to the google collabs below. Version 1 Version 2 Note: I didn't design the actual models only the preprocess function. You can replace my text file with any sufficiently large text file if you want to try it. submitted by /u/Divine_Invictus [link] [comments]  ( 9 min )
    A[r]xiv Dives - Fine-tuning with LoRA paper deep dive
    submitted by /u/FallMindless3563 [link] [comments]  ( 8 min )
    Urgent help needed regarding iNLTK [P]
    Hello i m using iNLTK ( Natural Language Toolkit for Indic Languag ) for my nlp mini project "paraphrase detection in hindi text" but I m getting this code error if anyone can help me solving it would be great. Thank you in advance. here is the code for the error section. from inltk.inltk import get_sentence_similarityfrom sklearn.metrics.pairwise import cosine_similarityget_sentence_similarity(text, 3, 'hi', cmp = cosine_similarity) I googled there are solution saying to install pytorch version 1.3.0 but when I try to install it's saying not available. !pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html. error : Looking in links: https://download.pytorch.org/whl/torch_stable.html. ERROR: Could not find a version that satisfies the requirement torch==1.3.1+cpu (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0) ERROR: No matching distribution found for torch==1.3.1+cpu submitted by /u/delulu-duck [link] [comments]  ( 9 min )
    [R] HyperFields: towards zero-shot NeRFs from text descriptions
    Generating 3D objects based solely on text descriptions has proven extremely challenging for AI. Current state-of-the-art methods require optimizing a full 3D model from scratch for each new prompt, which is computationally demanding. A new technique called HyperFields demonstrates promising progress in generating detailed 3D models directly from text prompts, without slow optimization. The HyperFields approach instead aims to learn a generalized mapping from language to 3D geometry representations. This would allow tailored 3D models to be produced for new text prompts efficiently in a single feedforward pass, without slow optimization. HyperFields combines two key techniques: A dynamic hypernetwork that takes in text and progressively predicts weights for a separate 3D generation network. The weight predictions are conditioned on previous layer activations, enabling specialization. Distilling individually optimized 3D networks into the hypernetwork, providing dense supervision for learning the complex text-to-3D mapping. In experiments, HyperFields exceeded previous state-of-the-art methods in sample efficiency and wall-clock convergence time by 5-10x. It demonstrated the ability to: Encode over 100 distinct objects like "yellow vase" in a single model Generalize to new text combinations without seeing that exact prompt before Rapidly adapt to generate completely novel objects with minimal fine-tuning However, limitations remain around flexibility, fine-grained details, and reliance on existing 2D guidance systems. TL;DR: HyperFields uses a dynamic hypernetwork to predict weights for a 3D generation network. The method is 5-10x faster than existing techniques and can quickly adapt to new text prompts, but has limitations in fine details. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [P] Can you explain to me how to improve my project? (new to machine learning)
    Hi, so I'm new to machine learning, I did a project to predict the prices of a house or appartment (rent or sale) and I wanted to know what I could have done to improve my model? :D here is the github repo: https://github.com/bovealexandre/immo-eliza-train-test-alexandre/tree/Dev Thank you a lot in advance submitted by /u/spaceinter92 [link] [comments]  ( 9 min )
    [Discussion]About to begin my PhD in Multi-Modality AI, any suggestions?
    I am currently in my last year of undergrad, and about to begin my direct PhD in Multi-Modality AI next year. I have been in the community of deep learning & NLP about 2 years. And I have witnessed the development of Transformers, from the simple GPT&BERT to nowadays' billions of parameters' monsters with 'gold crown' on the top of deep learning world. I have spent a lot of time with T5 Model, and its paper(Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, paper which I love very much!), trying to find an efficient way for usage of LLMs. I have handwritten Adapter Layer and LoRA to fine-tune T5 on Glue & SuperGlue. I have also tried out multiple fancy instruction fine-tuning LLMs like LLaMA and QWen. And this year earlier, I have noticed the wonder of Multi-Modality, and quickly fallen in love with it, which has now become my PhD focus. I have followed recent years' Multi-Modality development, especially CLIP and its follow-up works. And LLMs play quite an important role in today's vision-language model, say, BLIP2 and LLaVA. I believe due to the computational gap between schools and huge companies, the focus of my PhD career should be on Efficient Learning. I am also trying to enhance VLM through Retrieval-Augments. The target dataset may be Encyclopedic VQA, for even Large VLM failed to perform well on it, which could potentially be solved through Retrieval-Augmented VLM. I would like to hear any suggestions from you, including work-life balance, the direction of my academic focus and so on , which I would treasure very much in my new explorations of life stage. (I am currently doing RAG Question Answering Chat Bot for a company for my internship, and I would like to get suggestions from that too!) And are there subreddits like here?(I am also a member of LocalLLaMA, both subreddits benefits me a lot!) submitted by /u/Go2Heart [link] [comments]  ( 10 min )
    [D] NLP infrastructure
    I'm always willing to build a career in AI/ML infra. Usually when talking about AI infra in tech industry, we refer to training infra, serving infra, model deployment etc. Now with this genAI/LLM wave, I find many NLP specific infrastructure such as semantic indexing, vector databases are quickly rising up. So do semantic indexing/vector databases also count as AI infra? And is it a promising field? submitted by /u/Pitiful_Marketing733 [link] [comments]  ( 9 min )
    [D][R]What type of data streams(`im2col` matrix or regular conv) do commercial NPUs typically use for CNN? And where are `im2col` implemented, softwave(CPU) or HW accelerator for those situations where `im2col` is required?
    (I post this in several subreddit.) I'm gonna design an NN accelerator on FPGA. For NN, the basic operation is matrix-vector mult or matrix-matrix mult. And GEMM+im2col is easy to implement and many kinds of NN can be mapped on the designed accelerator based on GEMM+im2col easily. The disadvantage of it is that a little bit more bandwidth is required. I think it is a little tricky to design address-gen unit for the method of regular convolution when reading input feature map. So I want to know if GEMM+im2col is generally used in commercial NPUs or any type of accelerator(e.g. Microsoft Brainwave project on FPGA, they called it soft NPU), instead of regular convolution(seems generally used in academic paper). And if GEMM+im2col data stream is used, is im2col generally implemented on software(CPU) or designed in accelerator itself ? There is nothing about this in the paper of Microsoft Brainwave. All I know is HUAWEI Ascend designed im2col unit in the chip. By the way, is im2col finished on CPU when using pytorch+GPU? To be honest, my major is just digital IC, so I hardly coding Python to train model :( Thanks for your help!!! submitted by /u/ExcitingInternet6083 [link] [comments]  ( 9 min )
    [Project] LLM inference with vLLM and AMD: Achieving LLM inference parity with Nvidia
    I wanted to share some exciting news from the GPU world that could potentially change the game for LLM inference. AMD has been making significant strides in LLM inference, thanks to the porting of vLLM to ROCm 5.6. You can find the code implementation on GitHub. The result? AMD's MI210 now almost matches Nvidia's A100 in LLM inference performance. This is a significant development, as it could make AMD a more viable option for LLM inference tasks, which traditionally have been dominated by Nvidia. For those interested in the technical details, I recommend checking out this EmbeddedLLM Blog Post. I'm curious to hear your thoughts on this. Anyone manage to run it on RX 7900 XTX? https://preview.redd.it/rn7n29yxpuwb1.png?width=600&format=png&auto=webp&s=bdbac0d2b34d6f43a03503bbf72b446190248789 submitted by /u/openssp [link] [comments]  ( 9 min )
  • Open

    where re sources for chatGTP ?
    Hello can you help me ? all i know are https://chat.openai.com/ and https://platform.openai.com/playground ​ re there better sites to use? i m new to this and very comfused submitted by /u/proptuxiakoskariolis [link] [comments]  ( 8 min )
    Tool for calculating the sum of some values on a website
    Take this page on the DeFiLlama website: Protocol Treasuries, which contains a table with some financial data. I'm looking for an AI tool that can read the contents of this web page and then make some calculations. Specifically, I would like to give the tool this prompt: Calculate the sum of the values in the "Total Treasury" column I tried to use ChatGPT-4 with Bing, but it didn't work. Is there any tool that could be used here? submitted by /u/PaulRBerg [link] [comments]  ( 9 min )
    Microsoft's AI boost helped cloud business outpace rivals Amazon and Google
    Microsoft's cloud business outpaced rivals Amazon and Google in the third quarter, with accelerating growth driven by demand for artificial intelligence tools. Azure, Microsoft's cloud platform, reported 29% growth, faster than Google Cloud's 22% and more than double the pace of expansion at Amazon Web Services (AWS) at 12%. Microsoft's leadership position in AI projects and its partnership with OpenAI have contributed to its success. Analysts believe that Microsoft's results indicate it has taken the AI mantle from Google and that Azure could become a bigger hyperscale provider than AWS. Oracle, a new challenger in cloud computing, reported 66% growth in the August quarter. The cloud giants are still dealing with cost-saving initiatives from clients, which they call optimization. Source : https://www.cnbc.com/2023/10/27/microsoft-azure-outpaced-aws-and-google-cloud-in-latest-quarter.html submitted by /u/NuseAI [link] [comments]
    Pigeons solve problems the same way AI does, study says
    submitted by /u/thisisinsider [link] [comments]
    HyperFields: towards zero-shot NeRFs by mapping language to 3D geometry
    Generating 3D objects based solely on text descriptions has proven extremely challenging for AI. Current state-of-the-art methods require optimizing a full 3D model from scratch for each new prompt, which is computationally demanding. A new technique called HyperFields demonstrates promising progress in generating detailed 3D models directly from text prompts, without slow optimization. The HyperFields approach instead aims to learn a generalized mapping from language to 3D geometry representations. This would allow tailored 3D models to be produced for new text prompts efficiently in a single feedforward pass, without slow optimization. HyperFields combines two key techniques: A dynamic hypernetwork that takes in text and progressively predicts weights for a separate 3D generation network. The weight predictions are conditioned on previous layer activations, enabling specialization. Distilling individually optimized 3D networks into the hypernetwork, providing dense supervision for learning the complex text-to-3D mapping. In experiments, HyperFields exceeded previous state-of-the-art methods in sample efficiency and wall-clock convergence time by 5-10x. It demonstrated the ability to: Encode over 100 distinct objects like "yellow vase" in a single model Generalize to new text combinations without seeing that exact prompt before Rapidly adapt to generate completely novel objects with minimal fine-tuning However, limitations remain around flexibility, fine-grained details, and reliance on existing 2D guidance systems. TL;DR: HyperFields uses a dynamic hypernetwork to predict weights for a 3D generation network. The method is 5-10x faster than existing techniques and can quickly adapt to new text prompts, but has limitations in fine details. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    Science as a superhuman recursively self improving problem solving system
    I'm watching this interview with Francois Chollet where he talks about science as an example of a superhuman recursively self improving problem solving system and how we can use it to reason about what a superhuman artificial general intelligence might be like. One thing I find interesting is his claim that the amount of resources we are investing into science is exponentially increasing but we are only making linear progress. If we assume this is true, i.e. that to continue making linear progress in science we need to invest exponentially increasing resources, doesn't it imply that eventually if we can't keep investing the exponentially increasing required resources to keep make linear progress that eventually we will start making worse than linear progress? Does this imply that in the very long term scientific progress is likely to slow down significantly? https://youtu.be/Bo8MY4JpiXE?t=836 submitted by /u/tail-recursion [link] [comments]  ( 9 min )
    Have a doctor explain to a patient that the diagnosis was made by an AI doctor twice as intelligent as, and vastly more knowledgeable than, the top human doctor in any medical specialty
    Certainly, let's imagine how this might unfold. The doctor sits down across from the patient, maintaining eye contact and a level of directness. "Look, your diagnosis came from an AI medical system, and this isn't just any AI. Imagine the best doctor in the world for your condition—now envision something twice as intelligent and far more knowledgeable. That's what we're working with here. This AI has a grasp on medical data and studies that no single human could ever fully comprehend. We're talking about millions of data points analyzed in a fraction of the time it would take any human expert." Why does that matter for you? It boosts the accuracy and thoroughness of your diagnosis. Human error, subjectivity, or oversight? Virtually eliminated. The AI provides a diagnosis that considers every potential variable, something that even the best human doctors could miss. "But don't worry, this isn't a replacement for human medical care. It's a complement. I'm here to interpret, apply this knowledge, and oversee your treatment in a way that a machine can't—because medicine isn't just about data, it's also about human experience, context, and care." So, you're getting the best of both worlds: unparalleled computational power for diagnosis, and human expertise for treatment. Trust me, you're in exceptionally good hands. CGPT-4 submitted by /u/Georgeo57 [link] [comments]
  • Open

    Deep Q-Learning to Actor-Critic using Robotics Simulations with Panda-Gym
    Please like,follow and share: Deep Q-Learning to Actor-Critic using Robotics Simulations with Panda-Gym https://medium.com/@andysingal/deep-q-learning-to-actor-critic-using-robotics-simulations-with-panda-gym-ff220f980366 submitted by /u/Fit_Maintenance_2455 [link] [comments]
    Created a video for beginners about what is reinforcement learning and how it can control agents in the virtual environment
    submitted by /u/Ecstatic-Ring3057 [link] [comments]
    How can I predict the next best action for a DDPG RL agent?
    I have a DDPG agent that takes a continuous observation and outputs a continuous action vector (see below). outputs = layers.Dense(self.action_size, activation="tanh", kernel_initializer=last_init)(x) An example action output looks as follows: [0.48011236 0.47933139] When my agent observes a terminal state action pair, I add it to a list of observations called terminal observations. I would like it so these actions get blocked in the future so there is no possible way for the agent to take them again. I understand that I could just add a large negative penalty, but I would like to ensure that the state action pair cannot be taken again. Evidently, I would like it so when I input my state and recieve an action back, if this pair is in terminal observations. I would like it so these actions get blocked in the future so there is no possible way for the agent to retake them. I understand that I could just add a large negative penalty, but I would like to ensure that the state action pair cannot be taken again. I understand that this won't change much in the agent's behaviour as instead of taking `[0.48011236 0.47933139]`, it might pick `[0.4900000 0.47933139]`. But I am unsure how to go about this, specifically selecting the next best action. submitted by /u/ArchNemesisPlays [link] [comments]
  • Open

    Thinking Fast and Thinking Slow: System 1 and System 2
    submitted by /u/Neurosymbolic [link] [comments]
    Geometric Data Analysis Explained
    submitted by /u/plutoandmal [link] [comments]
  • Open

    Certifying that a system of polynomial equations has no solution
    It’s usually easier to show that a problem has a solution than to show that it does not have a solution. Analogy with prime numbers Showing that a number is prime amounts to saying that the problem of finding nontrivial factors has no solution. How could you convince a skeptic that a large number N is […] Certifying that a system of polynomial equations has no solution first appeared on John D. Cook.  ( 6 min )
  • Open

    Can large language models replace humans in the systematic review process? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. (arXiv:2310.17526v1 [cs.CL])
    Systematic reviews are vital for guiding practice, research, and policy, yet they are often slow and labour-intensive. Large language models (LLMs) could offer a way to speed up and automate systematic reviews, but their performance in such tasks has not been comprehensively evaluated against humans, and no study has tested GPT-4, the biggest LLM so far. This pre-registered study evaluates GPT-4's capability in title/abstract screening, full-text review, and data extraction across various literature types and languages using a 'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human performance in most tasks, results were skewed by chance agreement and dataset imbalance. After adjusting for these, there was a moderate level of performance for data extraction, and - barring studies that used highly reliable prompts - screening performance levelled at none to moderate for different stages and languages. When screening full-text literature using highly reliable prompts, GPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key studies using highly reliable prompts improved its performance even more. Our findings indicate that, currently, substantial caution should be used if LLMs are being used to conduct systematic reviews, but suggest that, for certain systematic review tasks delivered under reliable prompts, LLMs can rival human performance.  ( 3 min )
    Unsupervised Domain Adaptation for Semantic Segmentation with Pseudo Label Self-Refinement. (arXiv:2310.16979v1 [cs.CV])
    Deep learning-based solutions for semantic segmentation suffer from significant performance degradation when tested on data with different characteristics than what was used during the training. Adapting the models using annotated data from the new domain is not always practical. Unsupervised Domain Adaptation (UDA) approaches are crucial in deploying these models in the actual operating conditions. Recent state-of-the-art (SOTA) UDA methods employ a teacher-student self-training approach, where a teacher model is used to generate pseudo-labels for the new data which in turn guide the training process of the student model. Though this approach has seen a lot of success, it suffers from the issue of noisy pseudo-labels being propagated in the training process. To address this issue, we propose an auxiliary pseudo-label refinement network (PRN) for online refining of the pseudo labels and also localizing the pixels whose predicted labels are likely to be noisy. Being able to improve the quality of pseudo labels and select highly reliable ones, PRN helps self-training of segmentation models to be robust against pseudo label noise propagation during different stages of adaptation. We evaluate our approach on benchmark datasets with three different domain shifts, and our approach consistently performs significantly better than the previous state-of-the-art methods.  ( 2 min )
    Model-Based Runtime Monitoring with Interactive Imitation Learning. (arXiv:2310.17552v1 [cs.RO])
    Robot learning methods have recently made great strides, but generalization and robustness challenges still hinder their widespread deployment. Failing to detect and address potential failures renders state-of-the-art learning systems not combat-ready for high-stakes tasks. Recent advances in interactive imitation learning have presented a promising framework for human-robot teaming, enabling the robots to operate safely and continually improve their performances over long-term deployments. Nonetheless, existing methods typically require constant human supervision and preemptive feedback, limiting their practicality in realistic domains. This work aims to endow a robot with the ability to monitor and detect errors during task execution. We introduce a model-based runtime monitoring algorithm that learns from deployment data to detect system anomalies and anticipate failures. Unlike prior work that cannot foresee future failures or requires failure experiences for training, our method learns a latent-space dynamics model and a failure classifier, enabling our method to simulate future action outcomes and detect out-of-distribution and high-risk states preemptively. We train our method within an interactive imitation learning framework, where it continually updates the model from the experiences of the human-robot team collected using trustworthy deployments. Consequently, our method reduces the human workload needed over time while ensuring reliable task execution. Our method outperforms the baselines across system-level and unit-test metrics, with 23% and 40% higher success rates in simulation and on physical hardware, respectively. More information at https://ut-austin-rpl.github.io/sirius-runtime-monitor/  ( 2 min )
    Faster Recalibration of an Online Predictor via Approachability. (arXiv:2310.17002v1 [cs.LG])
    Predictive models in ML need to be trustworthy and reliable, which often at the very least means outputting calibrated probabilities. This can be particularly difficult to guarantee in the online prediction setting when the outcome sequence can be generated adversarially. In this paper we introduce a technique using Blackwell's approachability theorem for taking an online predictive model which might not be calibrated and transforming its predictions to calibrated predictions without much increase to the loss of the original model. Our proposed algorithm achieves calibration and accuracy at a faster rate than existing techniques arXiv:1607.03594 and is the first algorithm to offer a flexible tradeoff between calibration error and accuracy in the online setting. We demonstrate this by characterizing the space of jointly achievable calibration and regret using our technique.  ( 2 min )
    MACP: Efficient Model Adaptation for Cooperative Perception. (arXiv:2310.16870v1 [cs.CV])
    Vehicle-to-vehicle (V2V) communications have greatly enhanced the perception capabilities of connected and automated vehicles (CAVs) by enabling information sharing to "see through the occlusions", resulting in significant performance improvements. However, developing and training complex multi-agent perception models from scratch can be expensive and unnecessary when existing single-agent models show remarkable generalization capabilities. In this paper, we propose a new framework termed MACP, which equips a single-agent pre-trained model with cooperation capabilities. We approach this objective by identifying the key challenges of shifting from single-agent to cooperative settings, adapting the model by freezing most of its parameters and adding a few lightweight modules. We demonstrate in our experiments that the proposed framework can effectively utilize cooperative observations and outperform other state-of-the-art approaches in both simulated and real-world cooperative perception benchmarks while requiring substantially fewer tunable parameters with reduced communication costs. Our source code is available at https://github.com/PurdueDigitalTwin/MACP.  ( 2 min )
    Neural Optimal Transport with General Cost Functionals. (arXiv:2205.15403v3 [cs.LG] UPDATED)
    We introduce a novel neural network-based algorithm to compute optimal transport (OT) plans for general cost functionals. In contrast to common Euclidean costs, i.e., $\ell^1$ or $\ell^2$, such functionals provide more flexibility and allow using auxiliary information, such as class labels, to construct the required transport map. Existing methods for general costs are discrete and have limitations in practice, i.e. they do not provide an out-of-sample estimation. We address the challenge of designing a continuous OT approach for general costs that generalizes to new data points in high-dimensional spaces, such as images. Additionally, we provide the theoretical error analysis for our recovered transport plans. As an application, we construct a cost functional to map data distributions while preserving the class-wise structure.  ( 2 min )
    Revisiting Deep Learning Models for Tabular Data. (arXiv:2106.11959v5 [cs.LG] UPDATED)
    The existing literature on deep learning for tabular data proposes a wide range of novel architectures and reports competitive results on various datasets. However, the proposed models are usually not properly compared to each other and existing works often use different benchmarks and experiment protocols. As a result, it is unclear for both researchers and practitioners what models perform best. Additionally, the field still lacks effective baselines, that is, the easy-to-use models that provide competitive performance across different problems. In this work, we perform an overview of the main families of DL architectures for tabular data and raise the bar of baselines in tabular DL by identifying two simple and powerful deep architectures. The first one is a ResNet-like architecture which turns out to be a strong baseline that is often missing in prior works. The second model is our simple adaptation of the Transformer architecture for tabular data, which outperforms other solutions on most tasks. Both models are compared to many existing architectures on a diverse set of tasks under the same training and tuning protocols. We also compare the best DL models with Gradient Boosted Decision Trees and conclude that there is still no universally superior solution.  ( 3 min )
    Online Estimation and Community Detection of Network Point Processes for Event Streams. (arXiv:2009.01742v3 [cs.SI] UPDATED)
    A common goal in network modeling is to uncover the latent community structure present among nodes. For many real-world networks, the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of these temporal dynamics of interactions is to use point processes as the foundation of network models for community detection. Computational complexity hampers the scalability of such approaches to large sparse networks. To circumvent this challenge, we propose a fast online variational inference algorithm for estimating the latent structure underlying dynamic event arrivals on a network, using continuous-time point process latent network models. We describe this procedure for networks models capturing community structure. This structure can be learned as new events are observed on the network, updating the inferred community assignments. We investigate the theoretical properties of such an inference scheme, and provide regret bounds on the loss function of this procedure. The proposed inference procedure is then thoroughly compared, using both simulation studies and real data, to non-online variants. We demonstrate that online inference can obtain comparable performance, in terms of community recovery, to non-online variants, while realising computational gains. Our proposed inference framework can also be readily modified to incorporate other popular network structures.  ( 3 min )
    Increasing Fairness via Combination with Learning Guarantees. (arXiv:2301.10813v3 [cs.LG] UPDATED)
    The concern about underlying discrimination hidden in machine learning (ML) models is increasing, as ML systems have been widely applied in more and more real-world scenarios and any discrimination hidden in them will directly affect human life. Many techniques have been developed to enhance fairness including commonly-used group fairness measures and several fairness-aware methods combining ensemble learning. However, existing fairness measures can only focus on one aspect -- either group or individual fairness, and the hard compatibility among them indicates a possibility of remaining biases even if one of them is satisfied. Moreover, existing mechanisms to boost fairness usually present empirical results to show validity, yet few of them discuss whether fairness can be boosted with certain theoretical guarantees. To address these issues, we propose a fairness quality measure named discriminative risk to reflect both individual and group fairness aspects. Furthermore, we investigate the properties of the proposed measure and propose first- and second-order oracle bounds to show that fairness can be boosted via ensemble combination with theoretical learning guarantees. The analysis is suitable for both binary and multi-class classification. A pruning method is also proposed to utilise our proposed measure and comprehensive experiments are conducted to evaluate the effectiveness of the proposed methods.  ( 3 min )
    Learning Transferable Adversarial Robust Representations via Multi-view Consistency. (arXiv:2210.10485v2 [cs.LG] UPDATED)
    Despite the success on few-shot learning problems, most meta-learned models only focus on achieving good performance on clean examples and thus easily break down when given adversarially perturbed samples. While some recent works have shown that a combination of adversarial learning and meta-learning could enhance the robustness of a meta-learner against adversarial attacks, they fail to achieve generalizable adversarial robustness to unseen domains and tasks, which is the ultimate goal of meta-learning. To address this challenge, we propose a novel meta-adversarial multi-view representation learning framework with dual encoders. Specifically, we introduce the discrepancy across the two differently augmented samples of the same data instance by first updating the encoder parameters with them and further imposing a novel label-free adversarial attack to maximize their discrepancy. Then, we maximize the consistency across the views to learn transferable robust representations across domains and tasks. Through experimental validation on multiple benchmarks, we demonstrate the effectiveness of our framework on few-shot learning tasks from unseen domains, achieving over 10\% robust accuracy improvements against previous adversarial meta-learning baselines.  ( 2 min )
    Improved Best-of-Both-Worlds Guarantees for Multi-Armed Bandits: FTRL with General Regularizers and Multiple Optimal Arms. (arXiv:2302.13534v2 [cs.LG] UPDATED)
    We study the problem of designing adaptive multi-armed bandit algorithms that perform optimally in both the stochastic setting and the adversarial setting simultaneously (often known as a best-of-both-world guarantee). A line of recent works shows that when configured and analyzed properly, the Follow-the-Regularized-Leader (FTRL) algorithm, originally designed for the adversarial setting, can in fact optimally adapt to the stochastic setting as well. Such results, however, critically rely on an assumption that there exists one unique optimal arm. Recently, Ito (2021) took the first step to remove such an undesirable uniqueness assumption for one particular FTRL algorithm with the $\frac{1}{2}$-Tsallis entropy regularizer. In this work, we significantly improve and generalize this result, showing that uniqueness is unnecessary for FTRL with a broad family of regularizers and a new learning rate schedule. For some regularizers, our regret bounds also improve upon prior results even when uniqueness holds. We further provide an application of our results to the decoupled exploration and exploitation problem, demonstrating that our techniques are broadly applicable.  ( 3 min )
    Node-oriented Spectral Filtering for Graph Neural Networks. (arXiv:2212.03654v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have shown remarkable performance on homophilic graph data while being far less impressive when handling non-homophilic graph data due to the inherent low-pass filtering property of GNNs. In general, since real-world graphs are often complex mixtures of diverse subgraph patterns, learning a universal spectral filter on the graph from the global perspective as in most current works may still suffer from great difficulty in adapting to the variation of local patterns. On the basis of the theoretical analysis of local patterns, we rethink the existing spectral filtering methods and propose the node-oriented spectral filtering for graph neural network (namely NFGNN). By estimating the node-oriented spectral filter for each node, NFGNN is provided with the capability of precise local node positioning via the generalized translated operator, thus discriminating the variations of local homophily patterns adaptively. Meanwhile, the utilization of re-parameterization brings a good trade-off between global consistency and local sensibility for learning the node-oriented spectral filters. Furthermore, we theoretically analyze the localization property of NFGNN, demonstrating that the signal after adaptive filtering is still positioned around the corresponding node. Extensive experimental results demonstrate that the proposed NFGNN achieves more favorable performance.  ( 3 min )
    Artificial intelligence in government: Concepts, standards, and a unified framework. (arXiv:2210.17218v2 [cs.CY] UPDATED)
    Recent advances in artificial intelligence (AI), especially in generative language modelling, hold the promise of transforming government. Given the advanced capabilities of new AI systems, it is critical that these are embedded using standard operational procedures, clear epistemic criteria, and behave in alignment with the normative expectations of society. Scholars in multiple domains have subsequently begun to conceptualize the different forms that AI applications may take, highlighting both their potential benefits and pitfalls. However, the literature remains fragmented, with researchers in social science disciplines like public administration and political science, and the fast-moving fields of AI, ML, and robotics, all developing concepts in relative isolation. Although there are calls to formalize the emerging study of AI in government, a balanced account that captures the full depth of theoretical perspectives needed to understand the consequences of embedding AI into a public sector context is lacking. Here, we unify efforts across social and technical disciplines by first conducting an integrative literature review to identify and cluster 69 key terms that frequently co-occur in the multidisciplinary study of AI. We then build on the results of this bibliometric analysis to propose three new multifaceted concepts for understanding and analysing AI-based systems for government (AI-GOV) in a more unified way: (1) operational fitness, (2) epistemic alignment, and (3) normative divergence. Finally, we put these concepts to work by using them as dimensions in a conceptual typology of AI-GOV and connecting each with emerging AI technical measurement standards to encourage operationalization, foster cross-disciplinary dialogue, and stimulate debate among those aiming to rethink government with AI.  ( 3 min )
    Explanations Based on Item Response Theory (eXirt): A Model-Specific Method to Explain Tree-Ensemble Model in Trust Perspective. (arXiv:2210.09933v2 [cs.LG] UPDATED)
    In recent years, XAI researchers have been formalizing proposals and developing new methods to explain black box models, with no general consensus in the community on which method to use to explain these models, with this choice being almost directly linked to the popularity of a specific method. Methods such as Ciu, Dalex, Eli5, Lofo, Shap and Skater emerged with the proposal to explain black box models through global rankings of feature relevance, which based on different methodologies, generate global explanations that indicate how the model's inputs explain its predictions. In this context, 41 datasets, 4 tree-ensemble algorithms (Light Gradient Boosting, CatBoost, Random Forest, and Gradient Boosting), and 6 XAI methods were used to support the launch of a new XAI method, called eXirt, based on Item Response Theory - IRT and aimed at tree-ensemble black box models that use tabular data referring to binary classification problems. In the first set of analyses, the 164 global feature relevance ranks of the eXirt were compared with 984 ranks of the other XAI methods present in the literature, seeking to highlight their similarities and differences. In a second analysis, exclusive explanations of the eXirt based on Explanation-by-example were presented that help in understanding the model trust. Thus, it was verified that eXirt is able to generate global explanations of tree-ensemble models and also local explanations of instances of models through IRT, showing how this consolidated theory can be used in machine learning in order to obtain explainable and reliable models.  ( 3 min )
    Towards Better Generalization with Flexible Representation of Multi-Module Graph Neural Networks. (arXiv:2209.06589v4 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have become compelling models designed to perform learning and inference on graph-structured data. However, little work has been done to understand the fundamental limitations of GNNs for scaling to larger graphs and generalizing to out-of-distribution (OOD) inputs. In this paper, we use a random graph generator to systematically investigate how the graph size and structural properties affect the predictive performance of GNNs. We present specific evidence that the average node degree is a key feature in determining whether GNNs can generalize to unseen graphs, and that the use of multiple node update functions can improve the generalization performance of GNNs when dealing with graphs of multimodal degree distributions. Accordingly, we propose a multi-module GNN framework that allows the network to adapt flexibly to new graphs by generalizing a single canonical nonlinear transformation over aggregated inputs. Our results show that the multi-module GNNs improve the OOD generalization on a variety of inference tasks in the direction of diverse structural features.  ( 2 min )
    Out-of-Distribution Detection in Time-Series Domain: A Novel Seasonal Ratio Scoring Approach. (arXiv:2207.04306v3 [cs.LG] UPDATED)
    Safe deployment of time-series classifiers for real-world applications relies on the ability to detect the data which is not generated from the same distribution as training data. This task is referred to as out-of-distribution (OOD) detection. We consider the novel problem of OOD detection for the time-series domain. We discuss the unique challenges posed by time-series data and explain why prior methods from the image domain will perform poorly. Motivated by these challenges, this paper proposes a novel {\em Seasonal Ratio Scoring (SRS)} approach. SRS consists of three key algorithmic steps. First, each input is decomposed into class-wise semantic component and remainder. Second, this decomposition is employed to estimate the class-wise conditional likelihoods of the input and remainder using deep generative models. The seasonal ratio score is computed from these estimates. Third, a threshold interval is identified from the in-distribution data to detect OOD examples. Experiments on diverse real-world benchmarks demonstrate that the SRS method is well-suited for time-series OOD detection when compared to baseline methods. Open-source code for SRS method is provided at https://github.com/tahabelkhouja/SRS  ( 3 min )
    On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms. (arXiv:2206.05869v2 [cs.LG] UPDATED)
    Stochastic gradient descent (SGD) algorithm is the method of choice in many machine learning tasks thanks to its scalability and efficiency in dealing with large-scale problems. In this paper, we focus on the shuffling version of SGD which matches the mainstream practical heuristics. We show the convergence to a global solution of shuffling SGD for a class of non-convex functions under over-parameterized settings. Our analysis employs more relaxed non-convex assumptions than previous literature. Nevertheless, we maintain the desired computational complexity as shuffling SGD has achieved in the general convex setting.  ( 2 min )
    Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits. (arXiv:2107.11419v2 [stat.ML] UPDATED)
    We consider nonstationary multi-armed bandit problems where the model parameters of the arms change over time. We introduce the adaptive resetting bandit (ADR-bandit), a bandit algorithm class that leverages adaptive windowing techniques from literature on data streams. We first provide new guarantees on the quality of estimators resulting from adaptive windowing techniques, which are of independent interest. Furthermore, we conduct a finite-time analysis of ADR-bandit in two typical environments: an abrupt environment where changes occur instantaneously and a gradual environment where changes occur progressively. We demonstrate that ADR-bandit has nearly optimal performance when abrupt or gradual changes occur in a coordinated manner that we call global changes. We demonstrate that forced exploration is unnecessary when we assume such global changes. Unlike the existing nonstationary bandit algorithms, ADR-bandit has optimal performance in stationary environments as well as nonstationary environments with global changes. Our experiments show that the proposed algorithms outperform the existing approaches in synthetic and real-world environments.  ( 2 min )
    Characterizing the Implicit Bias of Regularized SGD in Rank Minimization. (arXiv:2206.05794v6 [cs.LG] UPDATED)
    We study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Unlike previous literature, our analysis does not rely on assumptions about the data, convergence, or optimality of the weight matrices and applies to a wide range of neural network architectures of any width or depth. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization.  ( 2 min )
    No-Regret Online Reinforcement Learning with Adversarial Losses and Transitions. (arXiv:2305.17380v3 [cs.LG] UPDATED)
    Existing online learning algorithms for adversarial Markov Decision Processes achieve ${O}(\sqrt{T})$ regret after $T$ rounds of interactions even if the loss functions are chosen arbitrarily by an adversary, with the caveat that the transition function has to be fixed. This is because it has been shown that adversarial transition functions make no-regret learning impossible. Despite such impossibility results, in this work, we develop algorithms that can handle both adversarial losses and adversarial transitions, with regret increasing smoothly in the degree of maliciousness of the adversary. More concretely, we first propose an algorithm that enjoys $\widetilde{{O}}(\sqrt{T} + C^{\textsf{P}})$ regret where $C^{\textsf{P}}$ measures how adversarial the transition functions are and can be at most ${O}(T)$. While this algorithm itself requires knowledge of $C^{\textsf{P}}$, we further develop a black-box reduction approach that removes this requirement. Moreover, we also show that further refinements of the algorithm not only maintains the same regret bound, but also simultaneously adapts to easier environments (where losses are generated in a certain stochastically constrained manner as in Jin et al. [2021]) and achieves $\widetilde{{O}}(U + \sqrt{UC^{\textsf{L}}} + C^{\textsf{P}})$ regret, where $U$ is some standard gap-dependent coefficient and $C^{\textsf{L}}$ is the amount of corruption on losses.
    COPF: Continual Learning Human Preference through Optimal Policy Fitting. (arXiv:2310.15694v3 [cs.LG] UPDATED)
    The technique of Reinforcement Learning from Human Feedback (RLHF) is a commonly employed method to improve pre-trained Language Models (LM), enhancing their ability to conform to human preferences. Nevertheless, the current RLHF-based LMs necessitate full retraining each time novel queries or feedback are introduced, which becomes a challenging task because human preferences can vary between different domains or tasks. Retraining LMs poses practical difficulties in many real-world situations due to the significant time and computational resources required, along with concerns related to data privacy. To address this limitation, we propose a new method called Continual Optimal Policy Fitting (COPF), in which we estimate a series of optimal policies using the Monte Carlo method, and then continually fit the policy sequence with the function regularization. COPF involves a single learning phase and doesn't necessitate complex reinforcement learning. Importantly, it shares the capability with RLHF to learn from unlabeled data, making it flexible for continual preference learning. Our experimental results show that COPF outperforms strong Continuous learning (CL) baselines when it comes to consistently aligning with human preferences on different tasks and domains.
    Spontaneous Symmetry Breaking in Generative Diffusion Models. (arXiv:2305.19693v3 [cs.LG] UPDATED)
    Generative diffusion models have recently emerged as a leading approach for generating high-dimensional data. In this paper, we show that the dynamics of these models exhibit a spontaneous symmetry breaking that divides the generative dynamics into two distinct phases: 1) A linear steady-state dynamics around a central fixed-point and 2) an attractor dynamics directed towards the data manifold. These two "phases" are separated by the change in stability of the central fixed-point, with the resulting window of instability being responsible for the diversity of the generated samples. Using both theoretical and empirical evidence, we show that an accurate simulation of the early dynamics does not significantly contribute to the final generation, since early fluctuations are reverted to the central fixed point. To leverage this insight, we propose a Gaussian late initialization scheme, which significantly improves model performance, achieving up to 3x FID improvements on fast samplers, while also increasing sample diversity (e.g., racial composition of generated CelebA images). Our work offers a new way to understand the generative dynamics of diffusion models that has the potential to bring about higher performance and less biased fast-samplers.
    Scaling Data-Constrained Language Models. (arXiv:2305.16264v4 [cs.CL] UPDATED)
    The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.
    A weighted-variance variational autoencoder model for speech enhancement. (arXiv:2211.00990v2 [cs.SD] CROSS LISTED)
    We address speech enhancement based on variational autoencoders, which involves learning a speech prior distribution in the time-frequency (TF) domain. A zero-mean complex-valued Gaussian distribution is usually assumed for the generative model, where the speech information is encoded in the variance as a function of a latent variable. In contrast to this commonly used approach, we propose a weighted variance generative model, where the contribution of each spectrogram time-frame in parameter learning is weighted. We impose a Gamma prior distribution on the weights, which would effectively lead to a Student's t-distribution instead of Gaussian for speech generative modeling. We develop efficient training and speech enhancement algorithms based on the proposed generative model. Our experimental results on spectrogram auto-encoding and speech enhancement demonstrate the effectiveness and robustness of the proposed approach compared to the standard unweighted variance model.
    SACSoN: Scalable Autonomous Control for Social Navigation. (arXiv:2306.01874v3 [cs.RO] UPDATED)
    Machine learning provides a powerful tool for building socially compliant robotic systems that go beyond simple predictive models of human behavior. By observing and understanding human interactions from past experiences, learning can enable effective social navigation behaviors directly from data. In this paper, our goal is to develop methods for training policies for socially unobtrusive navigation, such that robots can navigate among humans in ways that don't disturb human behavior. We introduce a definition for such behavior based on the counterfactual perturbation of the human: if the robot had not intruded into the space, would the human have acted in the same way? By minimizing this counterfactual perturbation, we can induce robots to behave in ways that do not alter the natural behavior of humans in the shared space. Instantiating this principle requires training policies to minimize their effect on human behavior, and this in turn requires data that allows us to model the behavior of humans in the presence of robots. Therefore, our approach is based on two key contributions. First, we collect a large dataset where an indoor mobile robot interacts with human bystanders. Second, we utilize this dataset to train policies that minimize counterfactual perturbation. We provide supplementary videos and make publicly available the largest-of-its-kind visual navigation dataset on our project page.
    Finding Regions of Counterfactual Explanations via Robust Optimization. (arXiv:2301.11113v3 [cs.LG] UPDATED)
    Counterfactual explanations play an important role in detecting bias and improving the explainability of data-driven classification models. A counterfactual explanation (CE) is a minimal perturbed data point for which the decision of the model changes. Most of the existing methods can only provide one CE, which may not be achievable for the user. In this work we derive an iterative method to calculate robust CEs, i.e. CEs that remain valid even after the features are slightly perturbed. To this end, our method provides a whole region of CEs allowing the user to choose a suitable recourse to obtain a desired outcome. We use algorithmic ideas from robust optimization and prove convergence results for the most common machine learning methods including logistic regression, decision trees, random forests, and neural networks. Our experiments show that our method can efficiently generate globally optimal robust CEs for a variety of common data sets and classification models.
    Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images. (arXiv:2304.06700v2 [cs.CV] UPDATED)
    Diffusion models have recently become the de-facto approach for generative modeling in the 2D domain. However, extending diffusion models to 3D is challenging due to the difficulties in acquiring 3D ground truth data for training. On the other hand, 3D GANs that integrate implicit 3D representations into GANs have shown remarkable 3D-aware generation when trained only on single-view image datasets. However, 3D GANs do not provide straightforward ways to precisely control image synthesis. To address these challenges, We present Control3Diff, a 3D diffusion model that combines the strengths of diffusion models and 3D GANs for versatile, controllable 3D-aware image synthesis for single-view datasets. Control3Diff explicitly models the underlying latent distribution (optionally conditioned on external inputs), thus enabling direct control during the diffusion process. Moreover, our approach is general and applicable to any type of controlling input, allowing us to train it with the same diffusion objective without any auxiliary supervision. We validate the efficacy of Control3Diff on standard image generation benchmarks, including FFHQ, AFHQ, and ShapeNet, using various conditioning inputs such as images, sketches, and text prompts. Please see the project website (\url{https://jiataogu.me/control3diff}) for video comparisons.
    Robust Covariate Shift Adaptation for Density-Ratio Estimation. (arXiv:2310.16638v2 [stat.ME] UPDATED)
    Consider a scenario where we have access to train data with both covariates and outcomes while test data only contains covariates. In this scenario, our primary aim is to predict the missing outcomes of the test data. With this objective in mind, we train parametric regression models under a covariate shift, where covariate distributions are different between the train and test data. For this problem, existing studies have proposed covariate shift adaptation via importance weighting using the density ratio. This approach averages the train data losses, each weighted by an estimated ratio of the covariate densities between the train and test data, to approximate the test-data risk. Although it allows us to obtain a test-data risk minimizer, its performance heavily relies on the accuracy of the density ratio estimation. Moreover, even if the density ratio can be consistently estimated, the estimation errors of the density ratio also yield bias in the estimators of the regression model's parameters of interest. To mitigate these challenges, we introduce a doubly robust estimator for covariate shift adaptation via importance weighting, which incorporates an additional estimator for the regression function. Leveraging double machine learning techniques, our estimator reduces the bias arising from the density ratio estimation errors. We demonstrate the asymptotic distribution of the regression parameter estimator. Notably, our estimator remains consistent if either the density ratio estimator or the regression function is consistent, showcasing its robustness against potential errors in density ratio estimation. Finally, we confirm the soundness of our proposed method via simulation studies.
    Generalizing to new geometries with Geometry-Aware Autoregressive Models (GAAMs) for fast calorimeter simulation. (arXiv:2305.11531v3 [physics.ins-det] UPDATED)
    Generation of simulated detector response to collision products is crucial to data analysis in particle physics, but computationally very expensive. One subdetector, the calorimeter, dominates the computational time due to the high granularity of its cells and complexity of the interactions. Generative models can provide more rapid sample production, but currently require significant effort to optimize performance for specific detector geometries, often requiring many models to describe the varying cell sizes and arrangements, without the ability to generalize to other geometries. We develop a $\textit{geometry-aware}$ autoregressive model, which learns how the calorimeter response varies with geometry, and is capable of generating simulated responses to unseen geometries without additional training. The geometry-aware model outperforms a baseline unaware model by over $50\%$ in several metrics such as the Wasserstein distance between the generated and the true distributions of key quantities which summarize the simulated response. A single geometry-aware model could replace the hundreds of generative models currently designed for calorimeter simulation by physicists analyzing data collected at the Large Hadron Collider. This proof-of-concept study motivates the design of a foundational model that will be a crucial tool for the study of future detectors, dramatically reducing the large upfront investment usually needed to develop generative calorimeter models.
    Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning. (arXiv:2301.12593v2 [cs.LG] UPDATED)
    Many real-world domains require safe decision making in uncertain environments. In this work, we introduce a deep reinforcement learning framework for approaching this important problem. We consider a distribution over transition models, and apply a risk-averse perspective towards model uncertainty through the use of coherent distortion risk measures. We provide robustness guarantees for this framework by showing it is equivalent to a specific class of distributionally robust safe reinforcement learning problems. Unlike existing approaches to robustness in deep reinforcement learning, however, our formulation does not involve minimax optimization. This leads to an efficient, model-free implementation of our approach that only requires standard data collection from a single training environment. In experiments on continuous control tasks with safety constraints, we demonstrate that our framework produces robust performance and safety at deployment time across a range of perturbed test environments.
    Efficient Sensor Placement from Regression with Sparse Gaussian Processes in Continuous and Discrete Spaces. (arXiv:2303.00028v6 [cs.RO] UPDATED)
    The sensor placement problem is a common problem that arises when monitoring correlated phenomena, such as temperature, precipitation, and salinity. Existing approaches to this problem typically formulate it as the maximization of information metrics, such as mutual information~(MI), and use optimization methods such as greedy algorithms in discrete domains, and derivative-free optimization methods such as genetic algorithms in continuous domains. However, computing MI for sensor placement requires discretizing the environment, and its computation cost depends on the size of the discretized environment. This limitation restricts these approaches from scaling to large problems. We have uncovered a novel connection between the sensor placement problem and sparse Gaussian processes~(SGP). Our approach leverages SGPs and is gradient-based, which allows us to efficiently find solution placements in continuous environments. We generalize our method to also handle discrete environments. Our experimental results on four real-world datasets demonstrate that our approach generates sensor placements consistently on par with or better than the prior state-of-the-art approaches in terms of both MI and reconstruction quality, all while being significantly faster. Our computationally efficient approach enables both large-scale sensor placement and fast robotic sensor placement for informative path planning algorithms.
    Convolutional Visual Prompt for Robust Visual Perception. (arXiv:2303.00198v2 [cs.CV] UPDATED)
    Vision models are often vulnerable to out-of-distribution (OOD) samples without adapting. While visual prompts offer a lightweight method of input-space adaptation for large-scale vision models, they rely on a high-dimensional additive vector and labeled data. This leads to overfitting when adapting models in a self-supervised test-time setting without labels. We introduce convolutional visual prompts (CVP) for label-free test-time adaptation for robust visual perception. The structured nature of CVP demands fewer trainable parameters, less than 1\% compared to standard visual prompts, combating overfitting. Extensive experiments and analysis on a wide variety of OOD visual perception tasks show that our approach is effective, improving robustness by up to 5.87% over several large-scale models.
    Variance Reduced Halpern Iteration for Finite-Sum Monotone Inclusions. (arXiv:2310.02987v2 [cs.LG] UPDATED)
    Machine learning approaches relying on such criteria as adversarial robustness or multi-agent settings have raised the need for solving game-theoretic equilibrium problems. Of particular relevance to these applications are methods targeting finite-sum structure, which generically arises in empirical variants of learning problems in these contexts. Further, methods with computable approximation errors are highly desirable, as they provide verifiable exit criteria. Motivated by these applications, we study finite-sum monotone inclusion problems, which model broad classes of equilibrium problems. Our main contributions are variants of the classical Halpern iteration that employ variance reduction to obtain improved complexity guarantees in which $n$ component operators in the finite sum are ``on average'' either cocoercive or Lipschitz continuous and monotone, with parameter $L$. The resulting oracle complexity of our methods, which provide guarantees for the last iterate and for a (computable) operator norm residual, is $\widetilde{\mathcal{O}}( n + \sqrt{n}L\varepsilon^{-1})$, which improves upon existing methods by a factor up to $\sqrt{n}$. This constitutes the first variance reduction-type result for general finite-sum monotone inclusions and for more specific problems such as convex-concave optimization when operator norm residual is the optimality measure. We further argue that, up to poly-logarithmic factors, this complexity is unimprovable in the monotone Lipschitz setting; i.e., the provided result is near-optimal.
    NLI4CT: Multi-Evidence Natural Language Inference for Clinical Trial Reports. (arXiv:2305.03598v2 [cs.CL] UPDATED)
    How can we interpret and retrieve medical evidence to support clinical decisions? Clinical trial reports (CTR) amassed over the years contain indispensable information for the development of personalized medicine. However, it is practically infeasible to manually inspect over 400,000+ clinical trial reports in order to find the best evidence for experimental treatments. Natural Language Inference (NLI) offers a potential solution to this problem, by allowing the scalable computation of textual entailment. However, existing NLI models perform poorly on biomedical corpora, and previously published datasets fail to capture the full complexity of inference over CTRs. In this work, we present a novel resource to advance research on NLI for reasoning on CTRs. The resource includes two main tasks. Firstly, to determine the inference relation between a natural language statement, and a CTR. Secondly, to retrieve supporting facts to justify the predicted relation. We provide NLI4CT, a corpus of 2400 statements and CTRs, annotated for these tasks. Baselines on this corpus expose the limitations of existing NLI models, with 6 state-of-the-art NLI models achieving a maximum F1 score of 0.627. To the best of our knowledge, we are the first to design a task that covers the interpretation of full CTRs. To encourage further work on this challenging dataset, we make the corpus, competition leaderboard, website and code to replicate the baseline experiments available at: https://github.com/ai-systems/nli4ct
    Unifying GANs and Score-Based Diffusion as Generative Particle Models. (arXiv:2305.16150v2 [cs.LG] UPDATED)
    Particle-based deep generative models, such as gradient flows and score-based diffusion models, have recently gained traction thanks to their striking performance. Their principle of displacing particle distributions using differential equations is conventionally seen as opposed to the previously widespread generative adversarial networks (GANs), which involve training a pushforward generator network. In this paper we challenge this interpretation, and propose a novel framework that unifies particle and adversarial generative models by framing generator training as a generalization of particle models. This suggests that a generator is an optional addition to any such generative model. Consequently, integrating a generator into a score-based diffusion model and training a GAN without a generator naturally emerge from our framework. We empirically test the viability of these original models as proofs of concepts of potential applications of our framework.
    SpatialRank: Urban Event Ranking with NDCG Optimization on Spatiotemporal Data. (arXiv:2310.00270v5 [cs.LG] UPDATED)
    The problem of urban event ranking aims at predicting the top-k most risky locations of future events such as traffic accidents and crimes. This problem is of fundamental importance to public safety and urban administration especially when limited resources are available. The problem is, however, challenging due to complex and dynamic spatio-temporal correlations between locations, uneven distribution of urban events in space, and the difficulty to correctly rank nearby locations with similar features. Prior works on event forecasting mostly aim at accurately predicting the actual risk score or counts of events for all the locations. Rankings obtained as such usually have low quality due to prediction errors. Learning-to-rank methods directly optimize measures such as Normalized Discounted Cumulative Gain (NDCG), but cannot handle the spatiotemporal autocorrelation existing among locations. In this paper, we bridge the gap by proposing a novel spatial event ranking approach named SpatialRank. SpatialRank features adaptive graph convolution layers that dynamically learn the spatiotemporal dependencies across locations from data. In addition, the model optimizes through surrogates a hybrid NDCG loss with a spatial component to better rank neighboring spatial locations. We design an importance-sampling with a spatial filtering algorithm to effectively evaluate the loss during training. Comprehensive experiments on three real-world datasets demonstrate that SpatialRank can effectively identify the top riskiest locations of crimes and traffic accidents and outperform state-of-art methods in terms of NDCG by up to 12.7%.
    Hierarchical clustering with OWA-based linkages, the Lance-Williams formula, and dendrogram inversions. (arXiv:2303.05683v2 [stat.ML] UPDATED)
    Agglomerative hierarchical clustering based on Ordered Weighted Averaging (OWA) operators not only generalises the single, complete, and average linkages, but also includes intercluster distances based on a few nearest or farthest neighbours, trimmed and winsorised means of pairwise point similarities, amongst many others. We explore the relationships between the famous Lance-Williams update formula and the extended OWA-based linkages with weights generated via infinite coefficient sequences. Furthermore, we provide some conditions for the weight generators to guarantee the resulting dendrograms to be free from unaesthetic inversions.
    Leave-one-out Distinguishability in Machine Learning. (arXiv:2309.17310v3 [cs.LG] UPDATED)
    We introduce a new analytical framework to quantify the changes in a machine learning algorithm's output distribution following the inclusion of a few data points in its training set, a notion we define as leave-one-out distinguishability (LOOD). This problem is key to measuring data **memorization** and **information leakage** in machine learning, and the **influence** of training data points on model predictions. We illustrate how our method broadens and refines existing empirical measures of memorization and privacy risks associated with training data. We use Gaussian processes to model the randomness of machine learning algorithms, and validate LOOD with extensive empirical analysis of information leakage using membership inference attacks. Our theoretical framework enables us to investigate the causes of information leakage and where the leakage is high. For example, we analyze the influence of activation functions, on data memorization. Additionally, our method allows us to optimize queries that disclose the most significant information about the training data in the leave-one-out setting. We illustrate how optimal queries can be used for accurate **reconstruction** of training data.
    Adaptive whitening with fast gain modulation and slow synaptic plasticity. (arXiv:2308.13633v2 [q-bio.NC] UPDATED)
    Neurons in early sensory areas rapidly adapt to changing sensory statistics, both by normalizing the variance of their individual responses and by reducing correlations between their responses. Together, these transformations may be viewed as an adaptive form of statistical whitening. Existing mechanistic models of adaptive whitening exclusively use either synaptic plasticity or gain modulation as the biological substrate for adaptation; however, on their own, each of these models has significant limitations. In this work, we unify these approaches in a normative multi-timescale mechanistic model that adaptively whitens its responses with complementary computational roles for synaptic plasticity and gain modulation. Gains are modified on a fast timescale to adapt to the current statistical context, whereas synapses are modified on a slow timescale to match structural properties of the input statistics that are invariant across contexts. Our model is derived from a novel multi-timescale whitening objective that factorizes the inverse whitening matrix into basis vectors, which correspond to synaptic weights, and a diagonal matrix, which corresponds to neuronal gains. We test our model on synthetic and natural datasets and find that the synapses learn optimal configurations over long timescales that enable adaptive whitening on short timescales using gain modulation.
    Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset Selection. (arXiv:2302.03857v5 [cs.LG] UPDATED)
    Adversarial contrastive learning (ACL) does not require expensive data annotations but outputs a robust representation that withstands adversarial attacks and also generalizes to a wide range of downstream tasks. However, ACL needs tremendous running time to generate the adversarial variants of all training data, which limits its scalability to large datasets. To speed up ACL, this paper proposes a robustness-aware coreset selection (RCS) method. RCS does not require label information and searches for an informative subset that minimizes a representational divergence, which is the distance of the representation between natural data and their virtual adversarial variants. The vanilla solution of RCS via traversing all possible subsets is computationally prohibitive. Therefore, we theoretically transform RCS into a surrogate problem of submodular maximization, of which the greedy search is an efficient solution with an optimality guarantee for the original problem. Empirically, our comprehensive results corroborate that RCS can speed up ACL by a large margin without significantly hurting the robustness transferability. Notably, to the best of our knowledge, we are the first to conduct ACL efficiently on the large-scale ImageNet-1K dataset to obtain an effective robust representation via RCS. Our source code is at https://github.com/GodXuxilie/Efficient_ACL_via_RCS.
    Diversified Outlier Exposure for Out-of-Distribution Detection via Informative Extrapolation. (arXiv:2310.13923v2 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection is important for deploying reliable machine learning models on real-world applications. Recent advances in outlier exposure have shown promising results on OOD detection via fine-tuning model with informatively sampled auxiliary outliers. However, previous methods assume that the collected outliers can be sufficiently large and representative to cover the boundary between ID and OOD data, which might be impractical and challenging. In this work, we propose a novel framework, namely, Diversified Outlier Exposure (DivOE), for effective OOD detection via informative extrapolation based on the given auxiliary outliers. Specifically, DivOE introduces a new learning objective, which diversifies the auxiliary distribution by explicitly synthesizing more informative outliers for extrapolation during training. It leverages a multi-step optimization method to generate novel outliers beyond the original ones, which is compatible with many variants of outlier exposure. Extensive experiments and analyses have been conducted to characterize and demonstrate the effectiveness of the proposed DivOE. The code is publicly available at: https://github.com/tmlr-group/DivOE.
    Monte Carlo guided Diffusion for Bayesian linear inverse problems. (arXiv:2308.07983v2 [stat.ML] UPDATED)
    Ill-posed linear inverse problems arise frequently in various applications, from computational photography to medical imaging. A recent line of research exploits Bayesian inference with informative priors to handle the ill-posedness of such problems. Amongst such priors, score-based generative models (SGM) have recently been successfully applied to several different inverse problems. In this study, we exploit the particular structure of the prior defined by the SGM to define a sequence of intermediate linear inverse problems. As the noise level decreases, the posteriors of these inverse problems get closer to the target posterior of the original inverse problem. To sample from this sequence of posteriors, we propose the use of Sequential Monte Carlo (SMC) methods. The proposed algorithm, MCGDiff, is shown to be theoretically grounded and we provide numerical simulations showing that it outperforms competing baselines when dealing with ill-posed inverse problems in a Bayesian setting.
    EqDrive: Efficient Equivariant Motion Forecasting with Multi-Modality for Autonomous Driving. (arXiv:2310.17540v1 [cs.RO])
    Forecasting vehicular motions in autonomous driving requires a deep understanding of agent interactions and the preservation of motion equivariance under Euclidean geometric transformations. Traditional models often lack the sophistication needed to handle the intricate dynamics inherent to autonomous vehicles and the interaction relationships among agents in the scene. As a result, these models have a lower model capacity, which then leads to higher prediction errors and lower training efficiency. In our research, we employ EqMotion, a leading equivariant particle, and human prediction model that also accounts for invariant agent interactions, for the task of multi-agent vehicle motion forecasting. In addition, we use a multi-modal prediction mechanism to account for multiple possible future paths in a probabilistic manner. By leveraging EqMotion, our model achieves state-of-the-art (SOTA) performance with fewer parameters (1.2 million) and a significantly reduced training time (less than 2 hours).
    On Embeddings for Numerical Features in Tabular Deep Learning. (arXiv:2203.05556v4 [cs.LG] UPDATED)
    Recently, Transformer-like deep architectures have shown strong performance on tabular data problems. Unlike traditional models, e.g., MLP, these architectures map scalar values of numerical features to high-dimensional embeddings before mixing them in the main backbone. In this work, we argue that embeddings for numerical features are an underexplored degree of freedom in tabular DL, which allows constructing more powerful DL models and competing with GBDT on some traditionally GBDT-friendly benchmarks. We start by describing two conceptually different approaches to building embedding modules: the first one is based on a piecewise linear encoding of scalar values, and the second one utilizes periodic activations. Then, we empirically demonstrate that these two approaches can lead to significant performance boosts compared to the embeddings based on conventional blocks such as linear layers and ReLU activations. Importantly, we also show that embedding numerical features is beneficial for many backbones, not only for Transformers. Specifically, after proper embeddings, simple MLP-like models can perform on par with the attention-based architectures. Overall, we highlight embeddings for numerical features as an important design aspect with good potential for further improvements in tabular DL.
    Optimal Scoring Rule Design under Partial Knowledge. (arXiv:2107.07420v2 [cs.GT] UPDATED)
    This paper studies the design of optimal proper scoring rules when the principal has partial knowledge of an agent's signal distribution. Recent work characterizes the proper scoring rules that maximize the increase of an agent's payoff when the agent chooses to access a costly signal to refine a posterior belief from her prior prediction, under the assumption that the agent's signal distribution is fully known to the principal. In our setting, the principal only knows about a set of distributions where the agent's signal distribution belongs. We formulate the scoring rule design problem as a max-min optimization that maximizes the worst-case increase in payoff across the set of distributions. We propose an efficient algorithm to compute an optimal scoring rule when the set of distributions is finite, and devise a fully polynomial-time approximation scheme that accommodates various infinite sets of distributions. We further remark that widely used scoring rules, such as the quadratic and log rules, as well as previously identified optimal scoring rules under full knowledge, can be far from optimal in our partial knowledge settings.
    Towards Unifying Diffusion Models for Probabilistic Spatio-Temporal Graph Learning. (arXiv:2310.17360v1 [cs.LG])
    Spatio-temporal graph learning is a fundamental problem in the Web of Things era, which enables a plethora of Web applications such as smart cities, human mobility and climate analysis. Existing approaches tackle different learning tasks independently, tailoring their models to unique task characteristics. These methods, however, fall short of modeling intrinsic uncertainties in the spatio-temporal data. Meanwhile, their specialized designs limit their universality as general spatio-temporal learning solutions. In this paper, we propose to model the learning tasks in a unified perspective, viewing them as predictions based on conditional information with shared spatio-temporal patterns. Based on this proposal, we introduce Unified Spatio-Temporal Diffusion Models (USTD) to address the tasks uniformly within the uncertainty-aware diffusion framework. USTD is holistically designed, comprising a shared spatio-temporal encoder and attention-based denoising networks that are task-specific. The shared encoder, optimized by a pre-training strategy, effectively captures conditional spatio-temporal patterns. The denoising networks, utilizing both cross- and self-attention, integrate conditional dependencies and generate predictions. Opting for forecasting and kriging as downstream tasks, we design Gated Attention (SGA) and Temporal Gated Attention (TGA) for each task, with different emphases on the spatial and temporal dimensions, respectively. By combining the advantages of deterministic encoders and probabilistic diffusion models, USTD achieves state-of-the-art performances compared to deterministic and probabilistic baselines in both tasks, while also providing valuable uncertainty estimates.
    Human-Guided Complexity-Controlled Abstractions. (arXiv:2310.17550v1 [cs.LG])
    Neural networks often learn task-specific latent representations that fail to generalize to novel settings or tasks. Conversely, humans learn discrete representations (i.e., concepts or words) at a variety of abstraction levels (e.g., ``bird'' vs. ``sparrow'') and deploy the appropriate abstraction based on task. Inspired by this, we train neural models to generate a spectrum of discrete representations, and control the complexity of the representations (roughly, how many bits are allocated for encoding inputs) by tuning the entropy of the distribution over representations. In finetuning experiments, using only a small number of labeled examples for a new task, we show that (1) tuning the representation to a task-appropriate complexity level supports the highest finetuning performance, and (2) in a human-participant study, users were able to identify the appropriate complexity level for a downstream task using visualizations of discrete representations. Our results indicate a promising direction for rapid model finetuning by leveraging human insight.
    Variance of ML-based software fault predictors: are we really improving fault prediction?. (arXiv:2310.17264v1 [cs.SE])
    Software quality assurance activities become increasingly difficult as software systems become more and more complex and continuously grow in size. Moreover, testing becomes even more expensive when dealing with large-scale systems. Thus, to effectively allocate quality assurance resources, researchers have proposed fault prediction (FP) which utilizes machine learning (ML) to predict fault-prone code areas. However, ML algorithms typically make use of stochastic elements to increase the prediction models' generalizability and efficiency of the training process. These stochastic elements, also known as nondeterminism-introducing (NI) factors, lead to variance in the training process and as a result, lead to variance in prediction accuracy and training time. This variance poses a challenge for reproducibility in research. More importantly, while fault prediction models may have shown good performance in the lab (e.g., often-times involving multiple runs and averaging outcomes), high variance of results can pose the risk that these models show low performance when applied in practice. In this work, we experimentally analyze the variance of a state-of-the-art fault prediction approach. Our experimental results indicate that NI factors can indeed cause considerable variance in the fault prediction models' accuracy. We observed a maximum variance of 10.10% in terms of the per-class accuracy metric. We thus, also discuss how to deal with such variance.
    Optimization dependent generalization bound for ReLU networks based on sensitivity in the tangent bundle. (arXiv:2310.17378v1 [cs.LG])
    Recent advances in deep learning have given us some very promising results on the generalization ability of deep neural networks, however literature still lacks a comprehensive theory explaining why heavily over-parametrized models are able to generalize well while fitting the training data. In this paper we propose a PAC type bound on the generalization error of feedforward ReLU networks via estimating the Rademacher complexity of the set of networks available from an initial parameter vector via gradient descent. The key idea is to bound the sensitivity of the network's gradient to perturbation of the input data along the optimization trajectory. The obtained bound does not explicitly depend on the depth of the network. Our results are experimentally verified on the MNIST and CIFAR-10 datasets.
    Sign Languague Recognition without frame-sequencing constraints: A proof of concept on the Argentinian Sign Language. (arXiv:2310.17437v1 [cs.CV])
    Automatic sign language recognition (SLR) is an important topic within the areas of human-computer interaction and machine learning. On the one hand, it poses a complex challenge that requires the intervention of various knowledge areas, such as video processing, image processing, intelligent systems and linguistics. On the other hand, robust recognition of sign language could assist in the translation process and the integration of hearing-impaired people, as well as the teaching of sign language for the hearing population. SLR systems usually employ Hidden Markov Models, Dynamic Time Warping or similar models to recognize signs. Such techniques exploit the sequential ordering of frames to reduce the number of hypothesis. This paper presents a general probabilistic model for sign classification that combines sub-classifiers based on different types of features such as position, movement and handshape. The model employs a bag-of-words approach in all classification steps, to explore the hypothesis that ordering is not essential for recognition. The proposed model achieved an accuracy rate of 97% on an Argentinian Sign Language dataset containing 64 classes of signs and 3200 samples, providing some evidence that indeed recognition without ordering is possible.
    Orchestration of Emulator Assisted Mobile Edge Tuning for AI Foundation Models: A Multi-Agent Deep Reinforcement Learning Approach. (arXiv:2310.17492v1 [cs.AI])
    The efficient deployment and fine-tuning of foundation models are pivotal in contemporary artificial intelligence. In this study, we present a groundbreaking paradigm integrating Mobile Edge Computing (MEC) with foundation models, specifically designed to enhance local task performance on user equipment (UE). Central to our approach is the innovative Emulator-Adapter architecture, segmenting the foundation model into two cohesive modules. This design not only conserves computational resources but also ensures adaptability and fine-tuning efficiency for downstream tasks. Additionally, we introduce an advanced resource allocation mechanism that is fine-tuned to the needs of the Emulator-Adapter structure in decentralized settings. To address the challenges presented by this system, we employ a hybrid multi-agent Deep Reinforcement Learning (DRL) strategy, adept at handling mixed discrete-continuous action spaces, ensuring dynamic and optimal resource allocations. Our comprehensive simulations and validations underscore the practical viability of our approach, demonstrating its robustness, efficiency, and scalability. Collectively, this work offers a fresh perspective on deploying foundation models and balancing computational efficiency with task proficiency.
    Secure short-term load forecasting for smart grids with transformer-based federated learning. (arXiv:2310.17477v1 [cs.LG])
    Electricity load forecasting is an essential task within smart grids to assist demand and supply balance. While advanced deep learning models require large amounts of high-resolution data for accurate short-term load predictions, fine-grained load profiles can expose users' electricity consumption behaviors, which raises privacy and security concerns. One solution to improve data privacy is federated learning, where models are trained locally on private data, and only the trained model parameters are merged and updated on a global server. Therefore, this paper presents a novel transformer-based deep learning approach with federated learning for short-term electricity load prediction. To evaluate our results, we benchmark our federated learning architecture against central and local learning and compare the performance of our model to long short-term memory models and convolutional neural networks. Our simulations are based on a dataset from a German university campus and show that transformer-based forecasting is a promising alternative to state-of-the-art models within federated learning.
    De-novo Chemical Reaction Generation by Means of Temporarily Convolutional Neural Networks. (arXiv:2310.17341v1 [cs.LG])
    We present here a combination of two networks, Recurrent Neural Networks (RNN) and Temporarily Convolutional Neural Networks (TCN) in de novo reaction generation using the novel Reaction Smiles-like representation of reactions (CGRSmiles) with atom mapping directly incorporated. Recurrent Neural Networks are known for their autoregressive properties and are frequently used in language modelling with direct application to SMILES generation. The relatively novel TCNs possess similar properties with wide receptive field while obeying the causality required for natural language processing (NLP). The combination of both latent representations expressed through TCN and RNN results in an overall better performance compared to RNN alone. Additionally, it is shown that different fine-tuning protocols have a profound impact on generative scope of the model when applied on a dataset of interest via transfer learning.
    Enhancing Graph Neural Networks with Structure-Based Prompt. (arXiv:2310.17394v1 [cs.LG])
    Graph Neural Networks (GNNs) are powerful in learning semantics of graph data. Recently, a new paradigm "pre-train, prompt" has shown promising results in adapting GNNs to various tasks with less supervised data. The success of such paradigm can be attributed to the more consistent objectives of pre-training and task-oriented prompt tuning, where the pre-trained knowledge can be effectively transferred to downstream tasks. However, an overlooked issue of existing studies is that the structure information of graph is usually exploited during pre-training for learning node representations, while neglected in the prompt tuning stage for learning task-specific parameters. To bridge this gap, we propose a novel structure-based prompting method for GNNs, namely SAP, which consistently exploits structure information in both pre-training and prompt tuning stages. In particular, SAP 1) employs a dual-view contrastive learning to align the latent semantic spaces of node attributes and graph structure, and 2) incorporates structure information in prompted graph to elicit more pre-trained knowledge in prompt tuning. We conduct extensive experiments on node classification and graph classification tasks to show the effectiveness of SAP. Moreover, we show that SAP can lead to better performance in more challenging few-shot scenarios on both homophilous and heterophilous graphs.
    Bayesian Neural Controlled Differential Equations for Treatment Effect Estimation. (arXiv:2310.17463v1 [cs.LG])
    Treatment effect estimation in continuous time is crucial for personalized medicine. However, existing methods for this task are limited to point estimates of the potential outcomes, whereas uncertainty estimates have been ignored. Needless to say, uncertainty quantification is crucial for reliable decision-making in medical applications. To fill this gap, we propose a novel Bayesian neural controlled differential equation (BNCDE) for treatment effect estimation in continuous time. In our BNCDE, the time dimension is modeled through a coupled system of neural controlled differential equations and neural stochastic differential equations, where the neural stochastic differential equations allow for tractable variational Bayesian inference. Thereby, for an assigned sequence of treatments, our BNCDE provides meaningful posterior predictive distributions of the potential outcomes. To the best of our knowledge, ours is the first tailored neural method to provide uncertainty estimates of treatment effects in continuous time. As such, our method is of direct practical value for promoting reliable decision-making in medicine.
    Detection Defenses: An Empty Promise against Adversarial Patch Attacks on Optical Flow. (arXiv:2310.17403v1 [cs.CV])
    Adversarial patches undermine the reliability of optical flow predictions when placed in arbitrary scene locations. Therefore, they pose a realistic threat to real-world motion detection and its downstream applications. Potential remedies are defense strategies that detect and remove adversarial patches, but their influence on the underlying motion prediction has not been investigated. In this paper, we thoroughly examine the currently available detect-and-remove defenses ILP and LGS for a wide selection of state-of-the-art optical flow methods, and illuminate their side effects on the quality and robustness of the final flow predictions. In particular, we implement defense-aware attacks to investigate whether current defenses are able to withstand attacks that take the defense mechanism into account. Our experiments yield two surprising results: Detect-and-remove defenses do not only lower the optical flow quality on benign scenes, in doing so, they also harm the robustness under patch attacks for all tested optical flow methods except FlowNetC. As currently employed detect-and-remove defenses fail to deliver the promised adversarial robustness for optical flow, they evoke a false sense of security. The code is available at https://github.com/cv-stuttgart/DetectionDefenses.
    Likelihood-based Out-of-Distribution Detection with Denoising Diffusion Probabilistic Models. (arXiv:2310.17432v1 [cs.LG])
    Out-of-Distribution detection between dataset pairs has been extensively explored with generative models. We show that likelihood-based Out-of-Distribution detection can be extended to diffusion models by leveraging the fact that they, like other likelihood-based generative models, are dramatically affected by the input sample complexity. Currently, all Out-of-Distribution detection methods with Diffusion Models are reconstruction-based. We propose a new likelihood ratio for Out-of-Distribution detection with Deep Denoising Diffusion Models, which we call the Complexity Corrected Likelihood Ratio. Our likelihood ratio is constructed using Evidence Lower-Bound evaluations from an individual model at various noising levels. We present results that are comparable to state-of-the-art Out-of-Distribution detection methods with generative models.
    Cross-modal Active Complementary Learning with Self-refining Correspondence. (arXiv:2310.17468v1 [cs.CV])
    Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby inevitably leading to a performance drop. Although some methods attempt to address such noise, they still face two challenging problems: excessive memorizing/overfitting and unreliable correction for NC, especially under high noise. To address the two problems, we propose a generalized Cross-modal Robust Complementary Learning framework (CRCL), which benefits from a novel Active Complementary Loss (ACL) and an efficient Self-refining Correspondence Correction (SCC) to improve the robustness of existing methods. Specifically, ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision, leading to theoretically and experimentally demonstrated robustness against NC. SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences, thereby alleviating error accumulation and achieving accurate and stable corrections. We carry out extensive experiments on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences.
    Causal Modeling with Stationary Diffusions. (arXiv:2310.17405v1 [cs.LG])
    We develop a novel approach towards causal inference. Rather than structural equations over a causal graph, we learn stochastic differential equations (SDEs) whose stationary densities model a system's behavior under interventions. These stationary diffusion models do not require the formalism of causal graphs, let alone the common assumption of acyclicity. We show that in several cases, they generalize to unseen interventions on their variables, often better than classical approaches. Our inference method is based on a new theoretical result that expresses a stationarity condition on the diffusion's generator in a reproducing kernel Hilbert space. The resulting kernel deviation from stationarity (KDS) is an objective function of independent interest.
    Towards Learning Monocular 3D Object Localization From 2D Labels using the Physical Laws of Motion. (arXiv:2310.17462v1 [cs.CV])
    We present a novel method for precise 3D object localization in single images from a single calibrated camera using only 2D labels. No expensive 3D labels are needed. Thus, instead of using 3D labels, our model is trained with easy-to-annotate 2D labels along with the physical knowledge of the object's motion. Given this information, the model can infer the latent third dimension, even though it has never seen this information during training. Our method is evaluated on both synthetic and real-world datasets, and we are able to achieve a mean distance error of just 6 cm in our experiments on real data. The results indicate the method's potential as a step towards learning 3D object location estimation, where collecting 3D data for training is not feasible.
    On the recognition of the game type based on physiological signals and eye tracking. (arXiv:2310.17383v1 [cs.LG])
    Automated interpretation of signals yields many impressive applications from the area of affective computing and human activity recognition (HAR). In this paper we ask the question about possibility of cognitive activity recognition on the base of particular set of signals. We use recognition of the game played by the participant as a playground for exploration of the problem. We build classifier of three different games (Space Invaders, Tetris, Tower Defence) and inter-game pause. We validate classifier in the player-independent and player-dependent scenario. We discuss the improvement in the player-dependent scenario in the context of biometric person recognition. On the base of the results obtained in game classification, we consider potential applications in smart surveillance and quantified self.
    Tackling Interference Induced by Data Training Loops in A/B Tests: A Weighted Training Approach. (arXiv:2310.17496v1 [stat.ME])
    In modern recommendation systems, the standard pipeline involves training machine learning models on historical data to predict user behaviors and improve recommendations continuously. However, these data training loops can introduce interference in A/B tests, where data generated by control and treatment algorithms, potentially with different distributions, are combined. To address these challenges, we introduce a novel approach called weighted training. This approach entails training a model to predict the probability of each data point appearing in either the treatment or control data and subsequently applying weighted losses during model training. We demonstrate that this approach achieves the least variance among all estimators without causing shifts in the training distributions. Through simulation studies, we demonstrate the lower bias and variance of our approach compared to other methods.
    The IMS Toucan System for the Blizzard Challenge 2023. (arXiv:2310.17499v1 [cs.CL])
    For our contribution to the Blizzard Challenge 2023, we improved on the system we submitted to the Blizzard Challenge 2021. Our approach entails a rule-based text-to-phoneme processing system that includes rule-based disambiguation of homographs in the French language. It then transforms the phonemes to spectrograms as intermediate representations using a fast and efficient non-autoregressive synthesis architecture based on Conformer and Glow. A GAN based neural vocoder that combines recent state-of-the-art approaches converts the spectrogram to the final wave. We carefully designed the data processing, training, and inference procedures for the challenge data. Our system identifier is G. Open source code and demo are available.
    Handshape recognition for Argentinian Sign Language using ProbSom. (arXiv:2310.17427v1 [cs.CV])
    Automatic sign language recognition is an important topic within the areas of human-computer interaction and machine learning. On the one hand, it poses a complex challenge that requires the intervention of various knowledge areas, such as video processing, image processing, intelligent systems and linguistics. On the other hand, robust recognition of sign language could assist in the translation process and the integration of hearing-impaired people. This paper offers two main contributions: first, the creation of a database of handshapes for the Argentinian Sign Language (LSA), which is a topic that has barely been discussed so far. Secondly, a technique for image processing, descriptor extraction and subsequent handshape classification using a supervised adaptation of self-organizing maps that is called ProbSom. This technique is compared to others in the state of the art, such as Support Vector Machines (SVM), Random Forests, and Neural Networks. The database that was built contains 800 images with 16 LSA handshapes, and is a first step towards building a comprehensive database of Argentinian signs. The ProbSom-based neural classifier, using the proposed descriptor, achieved an accuracy rate above 90%.
    Learning Regularized Graphon Mean-Field Games with Unknown Graphons. (arXiv:2310.17531v1 [cs.GT])
    We design and analyze reinforcement learning algorithms for Graphon Mean-Field Games (GMFGs). In contrast to previous works that require the precise values of the graphons, we aim to learn the Nash Equilibrium (NE) of the regularized GMFGs when the graphons are unknown. Our contributions are threefold. First, we propose the Proximal Policy Optimization for GMFG (GMFG-PPO) algorithm and show that it converges at a rate of $O(T^{-1/3})$ after $T$ iterations with an estimation oracle, improving on a previous work by Xie et al. (ICML, 2021). Second, using kernel embedding of distributions, we design efficient algorithms to estimate the transition kernels, reward functions, and graphons from sampled agents. Convergence rates are then derived when the positions of the agents are either known or unknown. Results for the combination of the optimization algorithm GMFG-PPO and the estimation algorithm are then provided. These algorithms are the first specifically designed for learning graphons from sampled agents. Finally, the efficacy of the proposed algorithms are corroborated through simulations. These simulations demonstrate that learning the unknown graphons reduces the exploitability effectively.
    Interactive Robot Learning from Verbal Correction. (arXiv:2310.17555v1 [cs.RO])
    The ability to learn and refine behavior after deployment has become ever more important for robots as we design them to operate in unstructured environments like households. In this work, we design a new learning system based on large language model (LLM), OLAF, that allows everyday users to teach a robot using verbal corrections when the robot makes mistakes, e.g., by saying "Stop what you're doing. You should move closer to the cup." A key feature of OLAF is its ability to update the robot's visuomotor neural policy based on the verbal feedback to avoid repeating mistakes in the future. This is in contrast to existing LLM-based robotic systems, which only follow verbal commands or corrections but not learn from them. We demonstrate the efficacy of our design in experiments where a user teaches a robot to perform long-horizon manipulation tasks both in simulation and on physical hardware, achieving on average 20.0% improvement in policy success rate. Videos and more results are at https://ut-austin-rpl.github.io/olaf/
    Invariance Measures for Neural Networks. (arXiv:2310.17404v1 [cs.LG])
    Invariances in neural networks are useful and necessary for many tasks. However, the representation of the invariance of most neural network models has not been characterized. We propose measures to quantify the invariance of neural networks in terms of their internal representation. The measures are efficient and interpretable, and can be applied to any neural network model. They are also more sensitive to invariance than previously defined measures. We validate the measures and their properties in the domain of affine transformations and the CIFAR10 and MNIST datasets, including their stability and interpretability. Using the measures, we perform a first analysis of CNN models and show that their internal invariance is remarkably stable to random weight initializations, but not to changes in dataset or transformation. We believe the measures will enable new avenues of research in invariance representation.
    The Expressive Power of Low-Rank Adaptation. (arXiv:2310.17513v1 [cs.LG])
    Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model $f$ to accurately represent any smaller target model $\overline{f}$ if LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of }\overline{f}}{\text{depth of }f}$. We also quantify the approximation error when LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters.
    Coalitional Bargaining via Reinforcement Learning: An Application to Collaborative Vehicle Routing. (arXiv:2310.17458v1 [cs.LG])
    Collaborative Vehicle Routing is where delivery companies cooperate by sharing their delivery information and performing delivery requests on behalf of each other. This achieves economies of scale and thus reduces cost, greenhouse gas emissions, and road congestion. But which company should partner with whom, and how much should each company be compensated? Traditional game theoretic solution concepts, such as the Shapley value or nucleolus, are difficult to calculate for the real-world problem of Collaborative Vehicle Routing due to the characteristic function scaling exponentially with the number of agents. This would require solving the Vehicle Routing Problem (an NP-Hard problem) an exponential number of times. We therefore propose to model this problem as a coalitional bargaining game where - crucially - agents are not given access to the characteristic function. Instead, we implicitly reason about the characteristic function, and thus eliminate the need to evaluate the VRP an exponential number of times - we only need to evaluate it once. Our contribution is that our decentralised approach is both scalable and considers the self-interested nature of companies. The agents learn using a modified Independent Proximal Policy Optimisation. Our RL agents outperform a strong heuristic bot. The agents correctly identify the optimal coalitions 79% of the time with an average optimality gap of 4.2% and reduction in run-time of 62%.
    CBD: A Certified Backdoor Detector Based on Local Dominant Probability. (arXiv:2310.17498v1 [cs.LG])
    Backdoor attack is a common threat to deep neural networks. During testing, samples embedded with a backdoor trigger will be misclassified as an adversarial target by a backdoored model, while samples without the backdoor trigger will be correctly classified. In this paper, we present the first certified backdoor detector (CBD), which is based on a novel, adjustable conformal prediction scheme based on our proposed statistic local dominant probability. For any classifier under inspection, CBD provides 1) a detection inference, 2) the condition under which the attacks are guaranteed to be detectable for the same classification domain, and 3) a probabilistic upper bound for the false positive rate. Our theoretical results show that attacks with triggers that are more resilient to test-time noise and have smaller perturbation magnitudes are more likely to be detected with guarantees. Moreover, we conduct extensive experiments on four benchmark datasets considering various backdoor types, such as BadNet, CB, and Blend. CBD achieves comparable or even higher detection accuracy than state-of-the-art detectors, and it in addition provides detection certification. Notably, for backdoor attacks with random perturbation triggers bounded by $\ell_2\leq0.75$ which achieves more than 90\% attack success rate, CBD achieves 100\% (98\%), 100\% (84\%), 98\% (98\%), and 72\% (40\%) empirical (certified) detection true positive rates on the four benchmark datasets GTSRB, SVHN, CIFAR-10, and TinyImageNet, respectively, with low false positive rates.
    Hierarchical Ensemble-Based Feature Selection for Time Series Forecasting. (arXiv:2310.17544v1 [cs.LG])
    We study a novel ensemble approach for feature selection based on hierarchical stacking in cases of non-stationarity and limited number of samples with large number of features. Our approach exploits the co-dependency between features using a hierarchical structure. Initially, a machine learning model is trained using a subset of features, and then the model's output is updated using another algorithm with the remaining features to minimize the target loss. This hierarchical structure allows for flexible depth and feature selection. By exploiting feature co-dependency hierarchically, our proposed approach overcomes the limitations of traditional feature selection methods and feature importance scores. The effectiveness of the approach is demonstrated on synthetic and real-life datasets, indicating improved performance with scalability and stability compared to the traditional methods and state-of-the-art approaches.
    A Challenge in Reweighting Data with Bilevel Optimization. (arXiv:2310.17386v1 [stat.ML])
    In many scenarios, one uses a large training set to train a model with the goal of performing well on a smaller testing set with a different distribution. Learning a weight for each data point of the training set is an appealing solution, as it ideally allows one to automatically learn the importance of each training point for generalization on the testing set. This task is usually formalized as a bilevel optimization problem. Classical bilevel solvers are based on a warm-start strategy where both the parameters of the models and the data weights are learned at the same time. We show that this joint dynamic may lead to sub-optimal solutions, for which the final data weights are very sparse. This finding illustrates the difficulty of data reweighting and offers a clue as to why this method is rarely used in practice.
    Foundation Model Based Native AI Framework in 6G with Cloud-Edge-End Collaboration. (arXiv:2310.17471v1 [cs.IT])
    Future wireless communication networks are in a position to move beyond data-centric, device-oriented connectivity and offer intelligent, immersive experiences based on task-oriented connections, especially in the context of the thriving development of pre-trained foundation models (PFM) and the evolving vision of 6G native artificial intelligence (AI). Therefore, redefining modes of collaboration between devices and servers and constructing native intelligence libraries become critically important in 6G. In this paper, we analyze the challenges of achieving 6G native AI from the perspectives of data, intelligence, and networks. Then, we propose a 6G native AI framework based on foundation models, provide a customization approach for intent-aware PFM, present a construction of a task-oriented AI toolkit, and outline a novel cloud-edge-end collaboration paradigm. As a practical use case, we apply this framework for orchestration, achieving the maximum sum rate within a wireless communication system, and presenting preliminary evaluation results. Finally, we outline research directions for achieving native AI in 6G.
    Fair collaborative vehicle routing: A deep multi-agent reinforcement learning approach. (arXiv:2310.17485v1 [cs.LG])
    Collaborative vehicle routing occurs when carriers collaborate through sharing their transportation requests and performing transportation requests on behalf of each other. This achieves economies of scale, thus reducing cost, greenhouse gas emissions and road congestion. But which carrier should partner with whom, and how much should each carrier be compensated? Traditional game theoretic solution concepts are expensive to calculate as the characteristic function scales exponentially with the number of agents. This would require solving the vehicle routing problem (NP-hard) an exponential number of times. We therefore propose to model this problem as a coalitional bargaining game solved using deep multi-agent reinforcement learning, where - crucially - agents are not given access to the characteristic function. Instead, we implicitly reason about the characteristic function; thus, when deployed in production, we only need to evaluate the expensive post-collaboration vehicle routing problem once. Our contribution is that we are the first to consider both the route allocation problem and gain sharing problem simultaneously - without access to the expensive characteristic function. Through decentralised machine learning, our agents bargain with each other and agree to outcomes that correlate well with the Shapley value - a fair profit allocation mechanism. Importantly, we are able to achieve a reduction in run-time of 88%.
    Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates. (arXiv:2310.17074v1 [cs.LG])
    In this work, we theoretically investigate the generalization properties of neural networks (NN) trained by stochastic gradient descent (SGD) algorithm with large learning rates. Under such a training regime, our finding is that, the oscillation of the NN weights caused by the large learning rate SGD training turns out to be beneficial to the generalization of the NN, which potentially improves over the same NN trained by SGD with small learning rates that converges more smoothly. In view of this finding, we call such a phenomenon "benign oscillation". Our theory towards demystifying such a phenomenon builds upon the feature learning perspective of deep learning. Specifically, we consider a feature-noise data generation model that consists of (i) weak features which have a small $\ell_2$-norm and appear in each data point; (ii) strong features which have a larger $\ell_2$-norm but only appear in a certain fraction of all data points; and (iii) noise. We prove that NNs trained by oscillating SGD with a large learning rate can effectively learn the weak features in the presence of those strong features. In contrast, NNs trained by SGD with a small learning rate can only learn the strong features but makes little progress in learning the weak features. Consequently, when it comes to the new testing data which consist of only weak features, the NN trained by oscillating SGD with a large learning rate could still make correct predictions consistently, while the NN trained by small learning rate SGD fails. Our theory sheds light on how large learning rate training benefits the generalization of NNs. Experimental results demonstrate our finding on "benign oscillation".
    Demonstration-Regularized RL. (arXiv:2310.17303v1 [stat.ML])
    Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using $N^{\mathrm{E}}$ expert demonstrations enables the identification of an optimal policy at a sample complexity of order $\widetilde{\mathcal{O}}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in finite and $\widetilde{\mathcal{O}}(\mathrm{Poly}(d,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in linear Markov decision processes, where $\varepsilon$ is the target precision, $H$ the horizon, $A$ the number of action, $S$ the number of states in the finite case and $d$ the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.
    Weakly-Supervised Surgical Phase Recognition. (arXiv:2310.17209v1 [cs.CV])
    A key element of computer-assisted surgery systems is phase recognition of surgical videos. Existing phase recognition algorithms require frame-wise annotation of a large number of videos, which is time and money consuming. In this work we join concepts of graph segmentation with self-supervised learning to derive a random-walk solution for per-frame phase prediction. Furthermore, we utilize within our method two forms of weak supervision: sparse timestamps or few-shot learning. The proposed algorithm enjoys low complexity and can operate in lowdata regimes. We validate our method by running experiments with the public Cholec80 dataset of laparoscopic cholecystectomy videos, demonstrating promising performance in multiple setups.
    On Forecast Stability. (arXiv:2310.17332v1 [cs.LG])
    Forecasts are typically not produced in a vacuum but in a business context, where forecasts are generated on a regular basis and interact with each other. For decisions, it may be important that forecasts do not change arbitrarily, and are stable in some sense. However, this area has received only limited attention in the forecasting literature. In this paper, we explore two types of forecast stability that we call vertical stability and horizontal stability. The existing works in the literature are only applicable to certain base models and extending these frameworks to be compatible with any base model is not straightforward. Furthermore, these frameworks can only stabilise the forecasts vertically. To fill this gap, we propose a simple linear-interpolation-based approach that is applicable to stabilise the forecasts provided by any base model vertically and horizontally. The approach can produce both accurate and stable forecasts. Using N-BEATS, Pooled Regression and LightGBM as the base models, in our evaluation on four publicly available datasets, the proposed framework is able to achieve significantly higher stability and/or accuracy compared to a set of benchmarks including a state-of-the-art forecast stabilisation method across three error metrics and six stability metrics.
    fairret: a Framework for Differentiable Fairness Regularization Terms. (arXiv:2310.17256v1 [cs.LG])
    Current tools for machine learning fairness only admit a limited range of fairness definitions and have seen little integration with automatic differentiation libraries, despite the central role these libraries play in modern machine learning pipelines. We introduce a framework of fairness regularization terms (fairrets) which quantify bias as modular objectives that are easily integrated in automatic differentiation pipelines. By employing a general definition of fairness in terms of linear-fractional statistics, a wide class of fairrets can be computed efficiently. Experiments show the behavior of their gradients and their utility in enforcing fairness with minimal loss of predictive power compared to baselines. Our contribution includes a PyTorch implementation of the fairret framework.
    Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise. (arXiv:2310.17167v1 [cs.LG])
    This paper introduces two key contributions aimed at improving the speed and quality of images generated through inverse diffusion processes. The first contribution involves reparameterizing the diffusion process in terms of the angle on a quarter-circular arc between the image and noise, specifically setting the conventional $\displaystyle \sqrt{\bar{\alpha}}=\cos(\eta)$. This reparameterization eliminates two singularities and allows for the expression of diffusion evolution as a well-behaved ordinary differential equation (ODE). In turn, this allows higher order ODE solvers such as Runge-Kutta methods to be used effectively. The second contribution is to directly estimate both the image ($\mathbf{x}_0$) and noise ($\mathbf{\epsilon}$) using our network, which enables more stable calculations of the update step in the inverse diffusion steps, as accurate estimation of both the image and noise are crucial at different stages of the process. Together with these changes, our model achieves faster generation, with the ability to converge on high-quality images more quickly, and higher quality of the generated images, as measured by metrics such as Frechet Inception Distance (FID), spatial Frechet Inception Distance (sFID), precision, and recall.
    C-Disentanglement: Discovering Causally-Independent Generative Factors under an Inductive Bias of Confounder. (arXiv:2310.17325v1 [cs.LG])
    Representation learning assumes that real-world data is generated by a few semantically meaningful generative factors (i.e., sources of variation) and aims to discover them in the latent space. These factors are expected to be causally disentangled, meaning that distinct factors are encoded into separate latent variables, and changes in one factor will not affect the values of the others. Compared to statistical independence, causal disentanglement allows more controllable data generation, improved robustness, and better generalization. However, most existing work assumes unconfoundedness in the discovery process, that there are no common causes to the generative factors and thus obtain only statistical independence. In this paper, we recognize the importance of modeling confounders in discovering causal generative factors. Unfortunately, such factors are not identifiable without proper inductive bias. We fill the gap by introducing a framework entitled Confounded-Disentanglement (C-Disentanglement), the first framework that explicitly introduces the inductive bias of confounder via labels from domain expertise. In addition, we accordingly propose an approach to sufficiently identify the causally disentangled factors under any inductive bias of the confounder. We conduct extensive experiments on both synthetic and real-world datasets. Our method demonstrates competitive results compared to various SOTA baselines in obtaining causally disentangled features and downstream tasks under domain shifts.
    CQM: Curriculum Reinforcement Learning with a Quantized World Model. (arXiv:2310.17330v1 [cs.LG])
    Recent curriculum Reinforcement Learning (RL) has shown notable progress in solving complex tasks by proposing sequences of surrogate tasks. However, the previous approaches often face challenges when they generate curriculum goals in a high-dimensional space. Thus, they usually rely on manually specified goal spaces. To alleviate this limitation and improve the scalability of the curriculum, we propose a novel curriculum method that automatically defines the semantic goal space which contains vital information for the curriculum process, and suggests curriculum goals over it. To define the semantic goal space, our method discretizes continuous observations via vector quantized-variational autoencoders (VQ-VAE) and restores the temporal relations between the discretized observations by a graph. Concurrently, ours suggests uncertainty and temporal distance-aware curriculum goals that converges to the final goals over the automatically composed goal space. We demonstrate that the proposed method allows efficient explorations in an uninformed environment with raw goal examples only. Also, ours outperforms the state-of-the-art curriculum RL methods on data efficiency and performance, in various goal-reaching tasks even with ego-centric visual inputs.
    How do Language Models Bind Entities in Context?. (arXiv:2310.17191v1 [cs.LG])
    To correctly use in-context information, language models (LMs) must bind entities to their attributes. For example, given a context describing a "green square" and a "blue circle", LMs must bind the shapes to their respective colors. We analyze LM representations and identify the binding ID mechanism: a general mechanism for solving the binding problem, which we observe in every sufficiently large model from the Pythia and LLaMA families. Using causal interventions, we show that LMs' internal activations represent binding information by attaching binding ID vectors to corresponding entities and attributes. We further show that binding ID vectors form a continuous subspace, in which distances between binding ID vectors reflect their discernability. Overall, our results uncover interpretable strategies in LMs for representing symbolic knowledge in-context, providing a step towards understanding general in-context reasoning in large-scale LMs.
    A multi-artifact EEG denoising by frequency-based deep learning. (arXiv:2310.17335v1 [cs.LG])
    Electroencephalographic (EEG) signals are fundamental to neuroscience research and clinical applications such as brain-computer interfaces and neurological disorder diagnosis. These signals are typically a combination of neurological activity and noise, originating from various sources, including physiological artifacts like ocular and muscular movements. Under this setting, we tackle the challenge of distinguishing neurological activity from noise-related sources. We develop a novel EEG denoising model that operates in the frequency domain, leveraging prior knowledge about noise spectral features to adaptively compute optimal convolutional filters for noise separation. The model is trained to learn an empirical relationship connecting the spectral characteristics of noise and noisy signal to a non-linear transformation which allows signal denoising. Performance evaluation on the EEGdenoiseNet dataset shows that the proposed model achieves optimal results according to both temporal and spectral metrics. The model is found to remove physiological artifacts from input EEG data, thus achieving effective EEG denoising. Indeed, the model performance either matches or outperforms that achieved by benchmark models, proving to effectively remove both muscle and ocular artifacts without the need to perform any training on the particular type of artifact.
    DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics. (arXiv:2310.13268v2 [cs.CV] UPDATED)
    Diffusion probabilistic models (DPMs) have exhibited excellent performance for high-fidelity image generation while suffering from inefficient sampling. Recent works accelerate the sampling procedure by proposing fast ODE solvers that leverage the specific ODE form of DPMs. However, they highly rely on specific parameterization during inference (such as noise/data prediction), which might not be the optimal choice. In this work, we propose a novel formulation towards the optimal parameterization during sampling that minimizes the first-order discretization error of the ODE solution. Based on such formulation, we propose \textit{DPM-Solver-v3}, a new fast ODE solver for DPMs by introducing several coefficients efficiently computed on the pretrained model, which we call \textit{empirical model statistics}. We further incorporate multistep methods and a predictor-corrector framework, and propose some techniques for improving sample quality at small numbers of function evaluations (NFE) or large guidance scales. Experiments show that DPM-Solver-v3 achieves consistently better or comparable performance in both unconditional and conditional sampling with both pixel-space and latent-space DPMs, especially in 5$\sim$10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE) on unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable Diffusion, bringing a speed-up of 15\%$\sim$30\% compared to previous state-of-the-art training-free methods. Code is available at \url{https://github.com/thu-ml/DPM-Solver-v3}.
    Graphical Object-Centric Actor-Critic. (arXiv:2310.17178v1 [cs.AI])
    There have recently been significant advances in the problem of unsupervised object-centric representation learning and its application to downstream tasks. The latest works support the argument that employing disentangled object representations in image-based object-centric reinforcement learning tasks facilitates policy learning. We propose a novel object-centric reinforcement learning algorithm combining actor-critic and model-based approaches to utilize these representations effectively. In our approach, we use a transformer encoder to extract object representations and graph neural networks to approximate the dynamics of an environment. The proposed method fills a research gap in developing efficient object-centric world models for reinforcement learning settings that can be used for environments with discrete or continuous action spaces. Our algorithm performs better in a visually complex 3D robotic environment and a 2D environment with compositional structure than the state-of-the-art model-free actor-critic algorithm built upon transformer architecture and the state-of-the-art monolithic model-based algorithm.
    CROP: Conservative Reward for Model-based Offline Policy Optimization. (arXiv:2310.17245v1 [cs.LG])
    Offline reinforcement learning (RL) aims to optimize policy using collected data without online interactions. Model-based approaches are particularly appealing for addressing offline RL challenges due to their capability to mitigate the limitations of offline data through data generation using models. Prior research has demonstrated that introducing conservatism into the model or Q-function during policy optimization can effectively alleviate the prevalent distribution drift problem in offline RL. However, the investigation into the impacts of conservatism in reward estimation is still lacking. This paper proposes a novel model-based offline RL algorithm, Conservative Reward for model-based Offline Policy optimization (CROP), which conservatively estimates the reward in model training. To achieve a conservative reward estimation, CROP simultaneously minimizes the estimation error and the reward of random actions. Theoretical analysis shows that this conservative reward mechanism leads to a conservative policy evaluation and helps mitigate distribution drift. Experiments on D4RL benchmarks showcase that the performance of CROP is comparable to the state-of-the-art baselines. Notably, CROP establishes an innovative connection between offline and online RL, highlighting that offline RL problems can be tackled by adopting online RL techniques to the empirical Markov decision process trained with a conservative reward. The source code is available with https://github.com/G0K0URURI/CROP.git.
    Multi-scale Diffusion Denoised Smoothing. (arXiv:2310.16779v2 [cs.LG] UPDATED)
    Along with recent diffusion models, randomized smoothing has become one of a few tangible approaches that offers adversarial robustness to models at scale, e.g., those of large pre-trained models. Specifically, one can perform randomized smoothing on any classifier via a simple "denoise-and-classify" pipeline, so-called denoised smoothing, given that an accurate denoiser is available - such as diffusion model. In this paper, we present scalable methods to address the current trade-off between certified robustness and accuracy in denoised smoothing. Our key idea is to "selectively" apply smoothing among multiple noise scales, coined multi-scale smoothing, which can be efficiently implemented with a single diffusion model. This approach also suggests a new objective to compare the collective robustness of multi-scale smoothed classifiers, and questions which representation of diffusion model would maximize the objective. To address this, we propose to further fine-tune diffusion model (a) to perform consistent denoising whenever the original image is recoverable, but (b) to generate rather diverse outputs otherwise. Our experiments show that the proposed multi-scale smoothing scheme combined with diffusion fine-tuning enables strong certified robustness available with high noise level while maintaining its accuracy closer to non-smoothed classifiers.
    Network Design through Graph Neural Networks: Identifying Challenges and Improving Performance. (arXiv:2310.17100v1 [cs.LG])
    Graph Neural Network (GNN) research has produced strategies to modify a graph's edges using gradients from a trained GNN, with the goal of network design. However, the factors which govern gradient-based editing are understudied, obscuring why edges are chosen and if edits are grounded in an edge's importance. Thus, we begin by analyzing the gradient computation in previous works, elucidating the factors that influence edits and highlighting the potential over-reliance on structural properties. Specifically, we find that edges can achieve high gradients due to structural biases, rather than importance, leading to erroneous edits when the factors are unrelated to the design task. To improve editing, we propose ORE, an iterative editing method that (a) edits the highest scoring edges and (b) re-embeds the edited graph to refresh gradients, leading to less biased edge choices. We empirically study ORE through a set of proposed design tasks, each with an external validation method, demonstrating that ORE improves upon previous methods by up to 50%.
    Technical Note: Feasibility of translating 3.0T-trained Deep-Learning Segmentation Models Out-of-the-Box on Low-Field MRI 0.55T Knee-MRI of Healthy Controls. (arXiv:2310.17152v1 [cs.CV])
    In the current study, our purpose is to evaluate the feasibility of applying deep learning (DL) enabled algorithms to quantify bilateral knee biomarkers in healthy controls scanned at 0.55T and compared with 3.0T. The current study assesses the performance of standard in-practice bone, and cartilage segmentation algorithms at 0.55T, both qualitatively and quantitatively, in terms of comparing segmentation performance, areas of improvement, and compartment-wise cartilage thickness values between 0.55T vs. 3.0T. Initial results demonstrate a usable to good technical feasibility of translating existing quantitative deep-learning-based image segmentation techniques, trained on 3.0T, out of 0.55T for knee MRI, in a multi-vendor acquisition environment. Especially in terms of segmenting cartilage compartments, the models perform almost equivalent to 3.0T in terms of Likert ranking. The 0.55T low-field sustainable and easy-to-install MRI, as demonstrated, thus, can be utilized for evaluating knee cartilage thickness and bone segmentations aided by established DL algorithms trained at higher-field strengths out-of-the-box initially. This could be utilized at the far-spread point-of-care locations with a lack of radiologists available to manually segment low-field images, at least till a decent base of low-field data pool is collated. With further fine-tuning with manual labeling of low-field data or utilizing synthesized higher SNR images from low-field images, OA biomarker quantification performance is potentially guaranteed to be further improved.
    PID-Inspired Inductive Biases for Deep Reinforcement Learning in Partially Observable Control Tasks. (arXiv:2307.05891v2 [cs.LG] UPDATED)
    Deep reinforcement learning (RL) has shown immense potential for learning to control systems through data alone. However, one challenge deep RL faces is that the full state of the system is often not observable. When this is the case, the policy needs to leverage the history of observations to infer the current state. At the same time, differences between the training and testing environments makes it critical for the policy not to overfit to the sequence of observations it sees at training time. As such, there is an important balancing act between having the history encoder be flexible enough to extract relevant information, yet be robust to changes in the environment. To strike this balance, we look to the PID controller for inspiration. We assert the PID controller's success shows that only summing and differencing are needed to accumulate information over time for many control tasks. Following this principle, we propose two architectures for encoding history: one that directly uses PID features and another that extends these core ideas and can be used in arbitrary control tasks. When compared with prior approaches, our encoders produce policies that are often more robust and achieve better performance on a variety of tracking tasks. Going beyond tracking tasks, our policies achieve 1.7x better performance on average over previous state-of-the-art methods on a suite of locomotion control tasks.
    TabR: Tabular Deep Learning Meets Nearest Neighbors in 2023. (arXiv:2307.14338v2 [cs.LG] UPDATED)
    Deep learning (DL) models for tabular data problems (e.g. classification, regression) are currently receiving increasingly more attention from researchers. However, despite the recent efforts, the non-DL algorithms based on gradient-boosted decision trees (GBDT) remain a strong go-to solution for these problems. One of the research directions aimed at improving the position of tabular DL involves designing so-called retrieval-augmented models. For a target object, such models retrieve other objects (e.g. the nearest neighbors) from the available training data and use their features and labels to make a better prediction. In this work, we present TabR -- essentially, a feed-forward network with a custom k-Nearest-Neighbors-like component in the middle. On a set of public benchmarks with datasets up to several million objects, TabR marks a big step forward for tabular DL: it demonstrates the best average performance among tabular DL models, becomes the new state-of-the-art on several datasets, and even outperforms GBDT models on the recently proposed "GBDT-friendly" benchmark (see Figure 1). Among the important findings and technical details powering TabR, the main ones lie in the attention-like mechanism that is responsible for retrieving the nearest neighbors and extracting valuable signal from them. In addition to the much higher performance, TabR is simple and significantly more efficient compared to prior retrieval-based tabular DL models.
    DSAC-C: Constrained Maximum Entropy for Robust Discrete Soft-Actor Critic. (arXiv:2310.17173v1 [cs.LG])
    We present a novel extension to the family of Soft Actor-Critic (SAC) algorithms. We argue that based on the Maximum Entropy Principle, discrete SAC can be further improved via additional statistical constraints derived from a surrogate critic policy. Furthermore, our findings suggests that these constraints provide an added robustness against potential domain shifts, which are essential for safe deployment of reinforcement learning agents in the real-world. We provide theoretical analysis and show empirical results on low data regimes for both in-distribution and out-of-distribution variants of Atari 2600 games.
    miditok: A Python package for MIDI file tokenization. (arXiv:2310.17202v1 [cs.LG])
    Recent progress in natural language processing has been adapted to the symbolic music modality. Language models, such as Transformers, have been used with symbolic music for a variety of tasks among which music generation, modeling or transcription, with state-of-the-art performances. These models are beginning to be used in production products. To encode and decode music for the backbone model, they need to rely on tokenizers, whose role is to serialize music into sequences of distinct elements called tokens. MidiTok is an open-source library allowing to tokenize symbolic music with great flexibility and extended features. It features the most popular music tokenizations, under a unified API. It is made to be easily used and extensible for everyone.
    A Neural Collapse Perspective on Feature Evolution in Graph Neural Networks. (arXiv:2307.01951v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have become increasingly popular for classification tasks on graph-structured data. Yet, the interplay between graph topology and feature evolution in GNNs is not well understood. In this paper, we focus on node-wise classification, illustrated with community detection on stochastic block model graphs, and explore the feature evolution through the lens of the "Neural Collapse" (NC) phenomenon. When training instance-wise deep classifiers (e.g. for image classification) beyond the zero training error point, NC demonstrates a reduction in the deepest features' within-class variability and an increased alignment of their class means to certain symmetric structures. We start with an empirical study that shows that a decrease in within-class variability is also prevalent in the node-wise classification setting, however, not to the extent observed in the instance-wise case. Then, we theoretically study this distinction. Specifically, we show that even an "optimistic" mathematical model requires that the graphs obey a strict structural condition in order to possess a minimizer with exact collapse. Interestingly, this condition is viable also for heterophilic graphs and relates to recent empirical studies on settings with improved GNNs' generalization. Furthermore, by studying the gradient dynamics of the theoretical model, we provide reasoning for the partial collapse observed empirically. Finally, we present a study on the evolution of within- and between-class feature variability across layers of a well-trained GNN and contrast the behavior with spectral methods.
    HCT: Hybrid Convnet-Transformer for Parkinson's disease detection and severity prediction from gait. (arXiv:2310.17078v1 [cs.CV])
    In this paper, we propose a novel deep learning method based on a new Hybrid ConvNet-Transformer architecture to detect and stage Parkinson's disease (PD) from gait data. We adopt a two-step approach by dividing the problem into two sub-problems. Our Hybrid ConvNet-Transformer model first distinguishes healthy versus parkinsonian patients. If the patient is parkinsonian, a multi-class Hybrid ConvNet-Transformer model determines the Hoehn and Yahr (H&Y) score to assess the PD severity stage. Our hybrid architecture exploits the strengths of both Convolutional Neural Networks (ConvNets) and Transformers to accurately detect PD and determine the severity stage. In particular, we take advantage of ConvNets to capture local patterns and correlations in the data, while we exploit Transformers for handling long-term dependencies in the input signal. We show that our hybrid method achieves superior performance when compared to other state-of-the-art methods, with a PD detection accuracy of 97% and a severity staging accuracy of 87%. Our source code is available at: https://github.com/SafwenNaimi
    Leveraging Ensemble Diversity for Robust Self-Training in the Presence of Sample Selection Bias. (arXiv:2310.14814v2 [cs.LG] UPDATED)
    Self-training is a well-known approach for semi-supervised learning. It consists of iteratively assigning pseudo-labels to unlabeled data for which the model is confident and treating them as labeled examples. For neural networks, softmax prediction probabilities are often used as a confidence measure, despite the fact that they are known to be overconfident, even for wrong predictions. This phenomenon is particularly intensified in the presence of sample selection bias, i.e., when data labeling is subject to some constraint. To address this issue, we propose a novel confidence measure, called $\mathcal{T}$-similarity, built upon the prediction diversity of an ensemble of linear classifiers. We provide the theoretical analysis of our approach by studying stationary points and describing the relationship between the diversity of the individual members and their performance. We empirically demonstrate the benefit of our confidence measure for three different pseudo-labeling policies on classification datasets of various data modalities.
    Joint Entity and Relation Extraction with Span Pruning and Hypergraph Neural Networks. (arXiv:2310.17238v1 [cs.CL])
    Entity and Relation Extraction (ERE) is an important task in information extraction. Recent marker-based pipeline models achieve state-of-the-art performance, but still suffer from the error propagation issue. Also, most of current ERE models do not take into account higher-order interactions between multiple entities and relations, while higher-order modeling could be beneficial.In this work, we propose HyperGraph neural network for ERE ($\hgnn{}$), which is built upon the PL-marker (a state-of-the-art marker-based pipleline model). To alleviate error propagation,we use a high-recall pruner mechanism to transfer the burden of entity identification and labeling from the NER module to the joint module of our model. For higher-order modeling, we build a hypergraph, where nodes are entities (provided by the span pruner) and relations thereof, and hyperedges encode interactions between two different relations or between a relation and its associated subject and object entities. We then run a hypergraph neural network for higher-order inference by applying message passing over the built hypergraph. Experiments on three widely used benchmarks (\acef{}, \ace{} and \scierc{}) for ERE task show significant improvements over the previous state-of-the-art PL-marker.
    Deep machine learning for meteor monitoring: advances with transfer learning and gradient-weighted class activation mapping. (arXiv:2310.16826v2 [astro-ph.EP] UPDATED)
    In recent decades, the use of optical detection systems for meteor studies has increased dramatically, resulting in huge amounts of data being analyzed. Automated meteor detection tools are essential for studying the continuous meteoroid incoming flux, recovering fresh meteorites, and achieving a better understanding of our Solar System. Concerning meteor detection, distinguishing false positives between meteor and non-meteor images has traditionally been performed by hand, which is significantly time-consuming. To address this issue, we developed a fully automated pipeline that uses Convolutional Neural Networks (CNNs) to classify candidate meteor detections. Our new method is able to detect meteors even in images that contain static elements such as clouds, the Moon, and buildings. To accurately locate the meteor within each frame, we employ the Gradient-weighted Class Activation Mapping (Grad-CAM) technique. This method facilitates the identification of the region of interest by multiplying the activations from the last convolutional layer with the average of the gradients across the feature map of that layer. By combining these findings with the activation map derived from the first convolutional layer, we effectively pinpoint the most probable pixel location of the meteor. We trained and evaluated our model on a large dataset collected by the Spanish Meteor Network (SPMN) and achieved a precision of 98\%. Our new methodology presented here has the potential to reduce the workload of meteor scientists and station operators and improve the accuracy of meteor tracking and classification.
    CEIL: Generalized Contextual Imitation Learning. (arXiv:2306.14534v2 [cs.LG] UPDATED)
    In this paper, we present \textbf{C}ont\textbf{E}xtual \textbf{I}mitation \textbf{L}earning~(CEIL), a general and broadly applicable algorithm for imitation learning (IL). Inspired by the formulation of hindsight information matching, we derive CEIL by explicitly learning a hindsight embedding function together with a contextual policy using the hindsight embeddings. To achieve the expert matching objective for IL, we advocate for optimizing a contextual variable such that it biases the contextual policy towards mimicking expert behaviors. Beyond the typical learning from demonstrations (LfD) setting, CEIL is a generalist that can be effectively applied to multiple settings including: 1)~learning from observations (LfO), 2)~offline IL, 3)~cross-domain IL (mismatched experts), and 4) one-shot IL settings. Empirically, we evaluate CEIL on the popular MuJoCo tasks (online) and the D4RL dataset (offline). Compared to prior state-of-the-art baselines, we show that CEIL is more sample-efficient in most online IL tasks and achieves better or competitive performances in offline tasks.
    Detecting and Mitigating Hallucinations in Multilingual Summarisation. (arXiv:2305.13632v2 [cs.CL] UPDATED)
    Hallucinations pose a significant challenge to the reliability of neural models for abstractive summarisation. While automatically generated summaries may be fluent, they often lack faithfulness to the original document. This issue becomes even more pronounced in low-resource settings, such as cross-lingual transfer. With the existing faithful metrics focusing on English, even measuring the extent of this phenomenon in cross-lingual settings is hard. To address this, we first develop a novel metric, mFACT, evaluating the faithfulness of non-English summaries, leveraging translation-based transfer from multiple English faithfulness metrics. We then propose a simple but effective method to reduce hallucinations with a cross-lingual transfer, which weighs the loss of each training example by its faithfulness score. Through extensive experiments in multiple languages, we demonstrate that mFACT is the metric that is most suited to detect hallucinations. Moreover, we find that our proposed loss weighting method drastically increases both performance and faithfulness according to both automatic and human evaluation when compared to strong baselines for cross-lingual transfer such as MAD-X. Our code and dataset are available at https://github.com/yfqiu-nlp/mfact-summ.
    Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations. (arXiv:2306.04618v2 [cs.CL] UPDATED)
    This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP. We find that the distribution shift settings in previous studies commonly lack adequate challenges, hindering the accurate evaluation of OOD robustness. To address these issues, we propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts. Then we introduce BOSS, a Benchmark suite for Out-of-distribution robustneSS evaluation covering 5 tasks and 20 datasets. Based on BOSS, we conduct a series of experiments on pre-trained language models for analysis and evaluation of OOD robustness. First, for vanilla fine-tuning, we examine the relationship between in-distribution (ID) and OOD performance. We identify three typical types that unveil the inner learning mechanism, which could potentially facilitate the forecasting of OOD robustness, correlating with the advancements on ID datasets. Then, we evaluate 5 classic methods on BOSS and find that, despite exhibiting some effectiveness in specific cases, they do not offer significant improvement compared to vanilla fine-tuning. Further, we evaluate 5 LLMs with various adaptation paradigms and find that when sufficient ID data is available, fine-tuning domain-specific models outperform LLMs on ID examples significantly. However, in the case of OOD instances, prioritizing LLMs with in-context learning yields better results. We identify that both fine-tuned small models and LLMs face challenges in effectively addressing downstream tasks. The code is public at \url{https://github.com/lifan-yuan/OOD_NLP}.
    Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory. (arXiv:2307.04204v2 [cs.LG] UPDATED)
    Cohen et al. (2021) empirically study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent (GD) trajectory and observe the Edge of Stability (EoS) phenomenon. The sharpness increases at the early phase of training (referred to as progressive sharpening), and eventually saturates close to the threshold of $2 / \text{(step size)}$. In this paper, we start by demonstrating through empirical studies that when the EoS phenomenon occurs, different GD trajectories (after a proper reparameterization) align on a specific bifurcation diagram independent of initialization. We then rigorously prove this trajectory alignment phenomenon for a two-layer fully-connected linear network and a single-neuron nonlinear network trained with a single data point. Our trajectory alignment analysis establishes both progressive sharpening and EoS phenomena, encompassing and extending recent findings in the literature.
    Distribution-Free Model-Agnostic Regression Calibration via Nonparametric Methods. (arXiv:2305.12283v2 [cs.LG] UPDATED)
    In this paper, we consider the uncertainty quantification problem for regression models. Specifically, we consider an individual calibration objective for characterizing the quantiles of the prediction model. While such an objective is well-motivated from downstream tasks such as newsvendor cost, the existing methods have been largely heuristic and lack of statistical guarantee in terms of individual calibration. We show via simple examples that the existing methods focusing on population-level calibration guarantees such as average calibration or sharpness can lead to harmful and unexpected results. We propose simple nonparametric calibration methods that are agnostic of the underlying prediction model and enjoy both computational efficiency and statistical consistency. Our approach enables a better understanding of the possibility of individual calibration, and we establish matching upper and lower bounds for the calibration error of our proposed methods. Technically, our analysis combines the nonparametric analysis with a covering number argument for parametric analysis, which advances the existing theoretical analyses in the literature of nonparametric density estimation and quantile bandit problems. Importantly, the nonparametric perspective sheds new theoretical insights into regression calibration in terms of the curse of dimensionality and reconciles the existing results on the impossibility of individual calibration. To our knowledge, we make the first effort to reach both individual calibration and finite-sample guarantee with minimal assumptions in terms of conformal prediction. Numerical experiments show the advantage of such a simple approach under various metrics, and also under covariates shift. We hope our work provides a simple benchmark and a starting point of theoretical ground for future research on regression calibration.
    Neural (Tangent Kernel) Collapse. (arXiv:2305.16427v2 [cs.LG] UPDATED)
    This work bridges two important concepts: the Neural Tangent Kernel (NTK), which captures the evolution of deep neural networks (DNNs) during training, and the Neural Collapse (NC) phenomenon, which refers to the emergence of symmetry and structure in the last-layer features of well-trained classification DNNs. We adopt the natural assumption that the empirical NTK develops a block structure aligned with the class labels, i.e., samples within the same class have stronger correlations than samples from different classes. Under this assumption, we derive the dynamics of DNNs trained with mean squared (MSE) loss and break them into interpretable phases. Moreover, we identify an invariant that captures the essence of the dynamics, and use it to prove the emergence of NC in DNNs with block-structured NTK. We provide large-scale numerical experiments on three common DNN architectures and three benchmark datasets to support our theory.
    A Batch-to-Online Transformation under Random-Order Model. (arXiv:2306.07163v2 [cs.LG] UPDATED)
    We introduce a transformation framework that can be utilized to develop online algorithms with low $\epsilon$-approximate regret in the random-order model from offline approximation algorithms. We first give a general reduction theorem that transforms an offline approximation algorithm with low average sensitivity to an online algorithm with low $\epsilon$-approximate regret. We then demonstrate that offline approximation algorithms can be transformed into a low-sensitivity version using a coreset construction method. To showcase the versatility of our approach, we apply it to various problems, including online $(k,z)$-clustering, online matrix approximation, and online regression, and successfully achieve polylogarithmic $\epsilon$-approximate regret for each problem. Moreover, we show that in all three cases, our algorithm also enjoys low inconsistency, which may be desired in some online applications.
    A Comprehensive Study of Groundbreaking Machine Learning Research: Analyzing highly cited and impactful publications across six decades. (arXiv:2308.00855v2 [cs.DL] UPDATED)
    Machine learning (ML) has emerged as a prominent field of research in computer science and other related fields, thereby driving advancements in other domains of interest. As the field continues to evolve, it is crucial to understand the landscape of highly cited publications to identify key trends, influential authors, and significant contributions made thus far. In this paper, we present a comprehensive bibliometric analysis of highly cited ML publications. We collected a dataset consisting of the top-cited papers from reputable ML conferences and journals, covering a period of several years from 1959 to 2022. We employed various bibliometric techniques to analyze the data, including citation analysis, co-authorship analysis, keyword analysis, and publication trends. Our findings reveal the most influential papers, highly cited authors, and collaborative networks within the machine learning community. We identify popular research themes and uncover emerging topics that have recently gained significant attention. Furthermore, we examine the geographical distribution of highly cited publications, highlighting the dominance of certain countries in ML research. By shedding light on the landscape of highly cited ML publications, our study provides valuable insights for researchers, policymakers, and practitioners seeking to understand the key developments and trends in this rapidly evolving field.
    On Performance Discrepancies Across Local Homophily Levels in Graph Neural Networks. (arXiv:2306.05557v3 [cs.SI] UPDATED)
    Graph Neural Network (GNN) research has highlighted a relationship between high homophily (i.e., the tendency of nodes of the same class to connect) and strong predictive performance in node classification. However, recent work has found the relationship to be more nuanced, demonstrating that simple GNNs can learn in certain heterophilous settings. To resolve these conflicting findings and align closer to real-world datasets, we go beyond the assumption of a global graph homophily level and study the performance of GNNs when the local homophily level of a node deviates from the global homophily level. Through theoretical and empirical analysis, we systematically demonstrate how shifts in local homophily can introduce performance degradation, leading to performance discrepancies across local homophily levels. We ground the practical implications of this work through granular analysis on five real-world datasets with varying global homophily levels, demonstrating that (a) GNNs can fail to generalize to test nodes that deviate from the global homophily of a graph, and (b) high local homophily does not necessarily confer high performance for a node. We further show that GNNs designed for globally heterophilous graphs can alleviate performance discrepancy by improving performance across local homophily levels, offering a new perspective on how these GNNs achieve stronger global performance.
    Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis. (arXiv:2306.08645v2 [cs.CV] UPDATED)
    Diffusion models (DMs) have recently gained attention with state-of-the-art performance in text-to-image synthesis. Abiding by the tradition in deep learning, DMs are trained and evaluated on the images with fixed sizes. However, users are demanding for various images with specific sizes and various aspect ratio. This paper focuses on adapting text-to-image diffusion models to handle such variety while maintaining visual fidelity. First we observe that, during the synthesis, lower resolution images suffer from incomplete object portrayal, while higher resolution images exhibit repetitively disordered presentation. Next, we establish a statistical relationship indicating that attention entropy changes with token quantity, suggesting that models aggregate spatial information in proportion to image resolution. The subsequent interpretation on our observations is that objects are incompletely depicted due to limited spatial information for low resolutions, while repetitively disorganized presentation arises from redundant spatial information for high resolutions. From this perspective, we propose a scaling factor to alleviate the change of attention entropy and mitigate the defective pattern observed. Extensive experimental results validate the efficacy of the proposed scaling factor, enabling models to achieve better visual effects, image quality, and text alignment. Notably, these improvements are achieved without additional training or fine-tuning techniques.
    Improving Multimodal Datasets with Image Captioning. (arXiv:2307.10350v2 [cs.LG] UPDATED)
    Massive web datasets play a key role in the success of large vision-language models like CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to reduce noise often come at the expense of data diversity. Our work focuses on caption quality as one major source of noise, and studies how generated captions can increase the utility of web-scraped datapoints with nondescript text. Through exploring different mixing strategies for raw and generated captions, we outperform the best filtering method proposed by the DataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a candidate pool of 128M image-text pairs. Our best approach is also 2x better at Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an effective source of text supervision. In experimenting with different image captioning models, we also demonstrate that the performance of a model on standard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable indicator of the utility of the captions it generates for multimodal training. Finally, our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text, as well as the importance of image curation with increasing training data quantity. The synthetic captions used in our experiments are now available on HuggingFace.
    Small Total-Cost Constraints in Contextual Bandits with Knapsacks, with Application to Fairness. (arXiv:2305.15807v2 [stat.ML] UPDATED)
    We consider contextual bandit problems with knapsacks [CBwK], a problem where at each round, a scalar reward is obtained and vector-valued costs are suffered. The learner aims to maximize the cumulative rewards while ensuring that the cumulative costs are lower than some predetermined cost constraints. We assume that contexts come from a continuous set, that costs can be signed, and that the expected reward and cost functions, while unknown, may be uniformly estimated -- a typical assumption in the literature. In this setting, total cost constraints had so far to be at least of order $T^{3/4}$, where $T$ is the number of rounds, and were even typically assumed to depend linearly on $T$. We are however motivated to use CBwK to impose a fairness constraint of equalized average costs between groups: the budget associated with the corresponding cost constraints should be as close as possible to the natural deviations, of order $\sqrt{T}$. To that end, we introduce a dual strategy based on projected-gradient-descent updates, that is able to deal with total-cost constraints of the order of $\sqrt{T}$ up to poly-logarithmic terms. This strategy is more direct and simpler than existing strategies in the literature. It relies on a careful, adaptive, tuning of the step size.
    DSAC-T: Distributional Soft Actor-Critic with Three Refinements. (arXiv:2310.05858v3 [cs.LG] UPDATED)
    Reinforcement learning (RL) has proven to be highly effective in tackling complex decision-making and control tasks. However, prevalent model-free RL methods often face severe performance degradation due to the well-known overestimation issue. In response to this problem, we recently introduced an off-policy RL algorithm, called distributional soft actor-critic (DSAC or DSAC-v1), which can effectively improve the value estimation accuracy by learning a continuous Gaussian value distribution. Nonetheless, standard DSAC has its own shortcomings, including occasionally unstable learning processes and needs for task-specific reward scaling, which may hinder its overall performance and adaptability in some special tasks. This paper further introduces three important refinements to standard DSAC in order to address these shortcomings. These refinements consist of critic gradient adjusting, twin value distribution learning, and variance-based target return clipping. The modified RL algorithm is named as DSAC with three refinements (DSAC-T or DSAC-v2), and its performances are systematically evaluated on a diverse set of benchmark tasks. Without any task-specific hyperparameter tuning, DSAC-T surpasses a lot of mainstream model-free RL algorithms, including SAC, TD3, DDPG, TRPO, and PPO, in all tested environments. Additionally, DSAC-T, unlike its standard version, ensures a highly stable learning process and delivers similar performance across varying reward scales.
    Driving through the Concept Gridlock: Unraveling Explainability Bottlenecks in Automated Driving. (arXiv:2310.16639v2 [cs.CV] UPDATED)
    Concept bottleneck models have been successfully used for explainable machine learning by encoding information within the model with a set of human-defined concepts. In the context of human-assisted or autonomous driving, explainability models can help user acceptance and understanding of decisions made by the autonomous vehicle, which can be used to rationalize and explain driver or vehicle behavior. We propose a new approach using concept bottlenecks as visual features for control command predictions and explanations of user and vehicle behavior. We learn a human-understandable concept layer that we use to explain sequential driving scenes while learning vehicle control commands. This approach can then be used to determine whether a change in a preferred gap or steering commands from a human (or autonomous vehicle) is led by an external stimulus or change in preferences. We achieve competitive performance to latent visual features while gaining interpretability within our model setup.
    Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension. (arXiv:2305.14077v2 [stat.ML] UPDATED)
    The success of over-parameterized neural networks trained to near-zero training error has caused great interest in the phenomenon of benign overfitting, where estimators are statistically consistent even though they interpolate noisy training data. While benign overfitting in fixed dimension has been established for some learning methods, current literature suggests that for regression with typical kernel methods and wide neural networks, benign overfitting requires a high-dimensional setting where the dimension grows with the sample size. In this paper, we show that the smoothness of the estimators, and not the dimension, is the key: benign overfitting is possible if and only if the estimator's derivatives are large enough. We generalize existing inconsistency results to non-interpolating models and more kernels to show that benign overfitting with moderate derivatives is impossible in fixed dimension. Conversely, we show that rate-optimal benign overfitting is possible for regression with a sequence of spiky-smooth kernels with large derivatives. Using neural tangent kernels, we translate our results to wide neural networks. We prove that while infinite-width networks do not overfit benignly with the ReLU activation, this can be fixed by adding small high-frequency fluctuations to the activation function. Our experiments verify that such neural networks, while overfitting, can indeed generalize well even on low-dimensional data sets.
    SEEDS: Exponential SDE Solvers for Fast High-Quality Sampling from Diffusion Models. (arXiv:2305.14267v2 [cs.LG] UPDATED)
    A potent class of generative models known as Diffusion Probabilistic Models (DPMs) has become prominent. A forward diffusion process adds gradually noise to data, while a model learns to gradually denoise. Sampling from pre-trained DPMs is obtained by solving differential equations (DE) defined by the learnt model, a process which has shown to be prohibitively slow. Numerous efforts on speeding-up this process have consisted on crafting powerful ODE solvers. Despite being quick, such solvers do not usually reach the optimal quality achieved by available slow SDE solvers. Our goal is to propose SDE solvers that reach optimal quality without requiring several hundreds or thousands of NFEs to achieve that goal. We propose Stochastic Explicit Exponential Derivative-free Solvers (SEEDS), improving and generalizing Exponential Integrator approaches to the stochastic case on several frameworks. After carefully analyzing the formulation of exact solutions of diffusion SDEs, we craft SEEDS to analytically compute the linear part of such solutions. Inspired by the Exponential Time-Differencing method, SEEDS use a novel treatment of the stochastic components of solutions, enabling the analytical computation of their variance, and contains high-order terms allowing to reach optimal quality sampling $\sim3$-$5\times$ faster than previous SDE methods. We validate our approach on several image generation benchmarks, showing that SEEDS outperform or are competitive with previous SDE solvers. Contrary to the latter, SEEDS are derivative and training free, and we fully prove strong convergence guarantees for them.
    Gaussian Membership Inference Privacy. (arXiv:2306.07273v2 [cs.LG] UPDATED)
    We propose a novel and practical privacy notion called $f$-Membership Inference Privacy ($f$-MIP), which explicitly considers the capabilities of realistic adversaries under the membership inference attack threat model. Consequently, $f$-MIP offers interpretable privacy guarantees and improved utility (e.g., better classification accuracy). In particular, we derive a parametric family of $f$-MIP guarantees that we refer to as $\mu$-Gaussian Membership Inference Privacy ($\mu$-GMIP) by theoretically analyzing likelihood ratio-based membership inference attacks on stochastic gradient descent (SGD). Our analysis highlights that models trained with standard SGD already offer an elementary level of MIP. Additionally, we show how $f$-MIP can be amplified by adding noise to gradient updates. Our analysis further yields an analytical membership inference attack that offers two distinct advantages over previous approaches. First, unlike existing state-of-the-art attacks that require training hundreds of shadow models, our attack does not require any shadow model. Second, our analytical attack enables straightforward auditing of our privacy notion $f$-MIP. Finally, we quantify how various hyperparameters (e.g., batch size, number of model parameters) and specific data characteristics determine an attacker's ability to accurately infer a point's membership in the training set. We demonstrate the effectiveness of our method on models trained on vision and tabular datasets.
    Statistical Component Separation for Targeted Signal Recovery in Noisy Mixtures. (arXiv:2306.15012v2 [stat.ML] UPDATED)
    Separating signals from an additive mixture may be an unnecessarily hard problem when one is only interested in specific properties of a given signal. In this work, we tackle simpler "statistical component separation" problems that focus on recovering a predefined set of statistical descriptors of a target signal from a noisy mixture. Assuming access to samples of the noise process, we investigate a method devised to match the statistics of the solution candidate corrupted by noise samples with those of the observed mixture. We first analyze the behavior of this method using simple examples with analytically tractable calculations. Then, we apply it in an image denoising context employing 1) wavelet-based descriptors, 2) ConvNet-based descriptors on astrophysics and ImageNet data. In the case of 1), we show that our method better recovers the descriptors of the target data than a standard denoising method in most situations. Additionally, despite not constructed for this purpose, it performs surprisingly well in terms of peak signal-to-noise ratio on full signal reconstruction. In comparison, representation 2) appears less suitable for image denoising. Finally, we extend this method by introducing a diffusive stepwise algorithm which gives a new perspective to the initial method and leads to promising results for image denoising under specific circumstances.
    Adaptive important sampling for Deep Ritz. (arXiv:2310.17185v1 [cs.LG])
    We introduce an adaptive sampling method for the Deep Ritz method aimed at solving partial differential equations (PDEs). Two deep neural networks are used. One network is employed to approximate the solution of PDEs, while the other one is a deep generative model used to generate new collocation points to refine the training set. The adaptive sampling procedure consists of two main steps. The first step is solving the PDEs using the Deep Ritz method by minimizing an associated variational loss discretized by the collocation points in the training set. The second step involves generating a new training set, which is then used in subsequent computations to further improve the accuracy of the current approximate solution. We treat the integrand in the variational loss as an unnormalized probability density function (PDF) and approximate it using a deep generative model called bounded KRnet. The new samples and their associated PDF values are obtained from the bounded KRnet. With these new samples and their associated PDF values, the variational loss can be approximated more accurately by importance sampling. Compared to the original Deep Ritz method, the proposed adaptive method improves accuracy, especially for problems characterized by low regularity and high dimensionality. We demonstrate the effectiveness of our new method through a series of numerical experiments.
    The Wasserstein Believer: Learning Belief Updates for Partially Observable Environments through Reliable Latent Space Models. (arXiv:2303.03284v3 [cs.LG] UPDATED)
    Partially Observable Markov Decision Processes (POMDPs) are used to model environments where the full state cannot be perceived by an agent. As such the agent needs to reason taking into account the past observations and actions. However, simply remembering the full history is generally intractable due to the exponential growth in the history space. Maintaining a probability distribution that models the belief over what the true state is can be used as a sufficient statistic of the history, but its computation requires access to the model of the environment and is often intractable. While SOTA algorithms use Recurrent Neural Networks to compress the observation-action history aiming to learn a sufficient statistic, they lack guarantees of success and can lead to sub-optimal policies. To overcome this, we propose the Wasserstein Belief Updater, an RL algorithm that learns a latent model of the POMDP and an approximation of the belief update. Our approach comes with theoretical guarantees on the quality of our approximation ensuring that our outputted beliefs allow for learning the optimal value function.
    RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers via Randomized Deletion. (arXiv:2302.01757v2 [cs.CR] UPDATED)
    Randomized smoothing is a leading approach for constructing classifiers that are certifiably robust against adversarial examples. Existing work on randomized smoothing has focused on classifiers with continuous inputs, such as images, where $\ell_p$-norm bounded adversaries are commonly studied. However, there has been limited work for classifiers with discrete or variable-size inputs, such as for source code, which require different threat models and smoothing mechanisms. In this work, we adapt randomized smoothing for discrete sequence classifiers to provide certified robustness against edit distance-bounded adversaries. Our proposed smoothing mechanism randomized deletion (RS-Del) applies random deletion edits, which are (perhaps surprisingly) sufficient to confer robustness against adversarial deletion, insertion and substitution edits. Our proof of certification deviates from the established Neyman-Pearson approach, which is intractable in our setting, and is instead organized around longest common subsequences. We present a case study on malware detection--a binary classification problem on byte sequences where classifier evasion is a well-established threat model. When applied to the popular MalConv malware detection model, our smoothing mechanism RS-Del achieves a certified accuracy of 91% at an edit distance radius of 128 bytes.
    Investigating Topological Order using Recurrent Neural Networks. (arXiv:2303.11207v3 [cond-mat.str-el] UPDATED)
    Recurrent neural networks (RNNs), originally developed for natural language processing, hold great promise for accurately describing strongly correlated quantum many-body systems. Here, we employ 2D RNNs to investigate two prototypical quantum many-body Hamiltonians exhibiting topological order. Specifically, we demonstrate that RNN wave functions can effectively capture the topological order of the toric code and a Bose-Hubbard spin liquid on the kagome lattice by estimating their topological entanglement entropies. We also find that RNNs favor coherent superpositions of minimally-entangled states over minimally-entangled states themselves. Overall, our findings demonstrate that RNN wave functions constitute a powerful tool to study phases of matter beyond Landau's symmetry-breaking paradigm.
    What Makes Data Suitable for a Locally Connected Neural Network? A Necessary and Sufficient Condition Based on Quantum Entanglement. (arXiv:2303.11249v4 [cs.LG] UPDATED)
    The question of what makes a data distribution suitable for deep learning is a fundamental open problem. Focusing on locally connected neural networks (a prevalent family of architectures that includes convolutional and recurrent neural networks as well as local self-attention models), we address this problem by adopting theoretical tools from quantum physics. Our main theoretical result states that a certain locally connected neural network is capable of accurate prediction over a data distribution if and only if the data distribution admits low quantum entanglement under certain canonical partitions of features. As a practical application of this result, we derive a preprocessing method for enhancing the suitability of a data distribution to locally connected neural networks. Experiments with widespread models over various datasets demonstrate our findings. We hope that our use of quantum entanglement will encourage further adoption of tools from physics for formally reasoning about the relation between deep learning and real-world data.
    Self-Evaluation Guided Beam Search for Reasoning. (arXiv:2305.00633v3 [cs.CL] UPDATED)
    Breaking down a problem into intermediate steps has demonstrated impressive performance in Large Language Model (LLM) reasoning. However, the growth of the reasoning chain introduces uncertainty and error accumulation, making it challenging to elicit accurate final results. To tackle this challenge of uncertainty in multi-step reasoning, we introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of LLMs. We propose a decoding algorithm integrating the self-evaluation guidance via stochastic beam search. The self-evaluation guidance serves as a better-calibrated automatic criterion, facilitating an efficient search in the reasoning space and resulting in superior prediction quality. Stochastic beam search balances exploitation and exploration of the search space with temperature-controlled randomness. Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34\%$, $9.56\%$, and $5.46\%$ on the GSM8K, AQuA, and StrategyQA benchmarks, respectively. Experiment results with Llama-2 on arithmetic reasoning demonstrate the efficiency of our method in outperforming the baseline methods with comparable computational budgets. Further analysis in multi-step reasoning finds our self-evaluation guidance pinpoints logic failures and leads to higher consistency and robustness. Our code is publicly available at https://guideddecoding.github.io/.
    Improving the Timing Resolution of Positron Emission Tomography Detectors Using Boosted Learning -- A Residual Physics Approach. (arXiv:2302.01681v2 [cs.LG] UPDATED)
    Artificial intelligence (AI) is entering medical imaging, mainly enhancing image reconstruction. Nevertheless, improvements throughout the entire processing, from signal detection to computation, potentially offer significant benefits. This work presents a novel and versatile approach to detector optimization using machine learning (ML) and residual physics. We apply the concept to positron emission tomography (PET), intending to improve the coincidence time resolution (CTR). PET visualizes metabolic processes in the body by detecting photons with scintillation detectors. Improved CTR performance offers the advantage of reducing radioactive dose exposure for patients. Modern PET detectors with sophisticated concepts and read-out topologies represent complex physical and electronic systems requiring dedicated calibration techniques. Traditional methods primarily depend on analytical formulations successfully describing the main detector characteristics. However, when accounting for higher-order effects, additional complexities arise matching theoretical models to experimental reality. Our work addresses this challenge by combining traditional calibration with AI and residual physics, presenting a highly promising approach. We present a residual physics-based strategy using gradient tree boosting and physics-guided data generation. The explainable AI framework SHapley Additive exPlanations (SHAP) was used to identify known physical effects with learned patterns. In addition, the models were tested against basic physical laws. We were able to improve the CTR significantly (more than 20%) for clinically relevant detectors of 19 mm height, reaching CTRs of 185 ps (450-550 keV).
    Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation. (arXiv:2305.11685v2 [eess.AS] UPDATED)
    Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks. However, huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies. In this study, we suggest to reuse attention maps across the Transformer layers, so as to remove key and query parameters while retaining the number of layers. Furthermore, we propose a novel masking distillation strategy to improve the student model's speech representation quality. We extend the distillation loss to utilize both masked and unmasked speech frames to fully leverage the teacher model's high-quality representation. Our universal compression strategy yields the student model that achieves phoneme error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB benchmark.
    Large-Scale Gaussian Processes via Alternating Projection. (arXiv:2310.17137v1 [cs.LG])
    Gaussian process (GP) hyperparameter optimization requires repeatedly solving linear systems with $n \times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative numerical methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the corresponding kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling \emph{mini-batching}. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove our method enjoys linear convergence and empirically we demonstrate its robustness to ill-conditioning. On large-scale benchmark datasets up to four million datapoints our approach accelerates training by a factor of 2$\times$ to 27$\times$ compared to CG.
    Understanding and Addressing the Pitfalls of Bisimulation-based Representations in Offline Reinforcement Learning. (arXiv:2310.17139v1 [cs.LG])
    While bisimulation-based approaches hold promise for learning robust state representations for Reinforcement Learning (RL) tasks, their efficacy in offline RL tasks has not been up to par. In some instances, their performance has even significantly underperformed alternative methods. We aim to understand why bisimulation methods succeed in online settings, but falter in offline tasks. Our analysis reveals that missing transitions in the dataset are particularly harmful to the bisimulation principle, leading to ineffective estimation. We also shed light on the critical role of reward scaling in bounding the scale of bisimulation measurements and of the value error they induce. Based on these findings, we propose to apply the expectile operator for representation learning to our offline RL setting, which helps to prevent overfitting to incomplete data. Meanwhile, by introducing an appropriate reward scaling strategy, we avoid the risk of feature collapse in representation space. We implement these recommendations on two state-of-the-art bisimulation-based algorithms, MICo and SimSR, and demonstrate performance gains on two benchmark suites: D4RL and Visual D4RL. Codes are provided at \url{https://github.com/zanghyu/Offline_Bisimulation}.
    Beyond MLE: Convex Learning for Text Generation. (arXiv:2310.17217v1 [cs.CL])
    Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution that best explain the observed data. In the context of text generation, MLE is often used to train generative language models, which can then be used to generate new text. However, we argue that MLE is not always necessary and optimal, especially for closed-ended text generation tasks like machine translation. In these tasks, the goal of model is to generate the most appropriate response, which does not necessarily require it to estimate the entire data distribution with MLE. To this end, we propose a novel class of training objectives based on convex functions, which enables text generation models to focus on highly probable outputs without having to estimate the entire data distribution. We investigate the theoretical properties of the optimal predicted distribution when applying convex functions to the loss, demonstrating that convex functions can sharpen the optimal distribution, thereby enabling the model to better capture outputs with high probabilities. Experiments on various text generation tasks and models show the effectiveness of our approach. It enables autoregressive models to bridge the gap between greedy and beam search, and facilitates the learning of non-autoregressive models with a maximum improvement of 9+ BLEU points. Moreover, our approach also exhibits significant impact on large language models (LLMs), substantially enhancing their generative capability on various tasks. Source code is available at \url{https://github.com/ictnlp/Convex-Learning}.
    Fairness and bias correction in machine learning for depression prediction: results from four study populations. (arXiv:2211.05321v3 [cs.LG] UPDATED)
    A significant level of stigma and inequality exists in mental healthcare, especially in under-served populations. Inequalities are reflected in the data collected for scientific purposes. When not properly accounted for, machine learning (ML) models leart from data can reinforce these structural inequalities or biases. Here, we present a systematic study of bias in ML models designed to predict depression in four different case studies covering different countries and populations. We find that standard ML approaches show regularly biased behaviors. We also show that mitigation techniques, both standard and our own post-hoc method, can be effective in reducing the level of unfair bias. No single best ML model for depression prediction provides equality of outcomes. This emphasizes the importance of analyzing fairness during model selection and transparent reporting about the impact of debiasing interventions. Finally, we provide practical recommendations to develop bias-aware ML models for depression risk prediction.
    Learning Rate Free Bayesian Inference in Constrained Domains. (arXiv:2305.14943v2 [stat.ML] UPDATED)
    We introduce a suite of new particle-based algorithms for sampling on constrained domains which are entirely learning rate free. Our approach leverages coin betting ideas from convex optimisation, and the viewpoint of constrained sampling as a mirrored optimisation problem on the space of probability measures. Based on this viewpoint, we also introduce a unifying framework for several existing constrained sampling algorithms, including mirrored Langevin dynamics and mirrored Stein variational gradient descent. We demonstrate the performance of our algorithms on a range of numerical examples, including sampling from targets on the simplex, sampling with fairness constraints, and constrained sampling problems in post-selection inference. Our results indicate that our algorithms achieve competitive performance with existing constrained sampling methods, without the need to tune any hyperparameters.
    Simulation based stacking. (arXiv:2310.17009v1 [stat.ME])
    Simulation-based inference has been popular for amortized Bayesian computation. It is typical to have more than one posterior approximation, from different inference algorithms, different architectures, or simply the randomness of initialization and stochastic gradients. With a provable asymptotic guarantee, we present a general stacking framework to make use of all available posterior approximations. Our stacking method is able to combine densities, simulation draws, confidence intervals, and moments, and address the overall precision, calibration, coverage, and bias at the same time. We illustrate our method on several benchmark simulations and a challenging cosmological inference task.
    Automating lichen monitoring in ecological studies using instance segmentation of time-lapse images. (arXiv:2310.17080v1 [cs.CV])
    Lichens are symbiotic organisms composed of fungi, algae, and/or cyanobacteria that thrive in a variety of environments. They play important roles in carbon and nitrogen cycling, and contribute directly and indirectly to biodiversity. Ecologists typically monitor lichens by using them as indicators to assess air quality and habitat conditions. In particular, epiphytic lichens, which live on trees, are key markers of air quality and environmental health. A new method of monitoring epiphytic lichens involves using time-lapse cameras to gather images of lichen populations. These cameras are used by ecologists in Newfoundland and Labrador to subsequently analyze and manually segment the images to determine lichen thalli condition and change. These methods are time-consuming and susceptible to observer bias. In this work, we aim to automate the monitoring of lichens over extended periods and to estimate their biomass and condition to facilitate the task of ecologists. To accomplish this, our proposed framework uses semantic segmentation with an effective training approach to automate monitoring and biomass estimation of epiphytic lichens on time-lapse images. We show that our method has the potential to significantly improve the accuracy and efficiency of lichen population monitoring, making it a valuable tool for forest ecologists and environmental scientists to evaluate the impact of climate change on Canada's forests. To the best of our knowledge, this is the first time that such an approach has been used to assist ecologists in monitoring and analyzing epiphytic lichens.
    ZipLM: Inference-Aware Structured Pruning of Language Models. (arXiv:2302.04089v2 [cs.LG] UPDATED)
    The breakthrough performance of large language models (LLMs) comes with major computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a novel structured compression approach for LLMs, called ZipLM. ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups in any given inference environment. Specifically, given a model, a dataset, an inference environment, as well as a set of speedup targets, ZipLM iteratively identifies and removes components with the worst loss-runtime trade-off. Unlike prior methods that specialize in either the post-training/one-shot or the gradual compression setting, and only for specific families of models such as BERT (encoder) or GPT (decoder), ZipLM produces state-of-the-art compressed models across all these settings. Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications. In particular, ZipLM outperforms all prior BERT-base distillation and pruning techniques, such as CoFi, MiniLM, and TinyBERT. Moreover, it matches the performance of the heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT-large model. When compressing GPT2, ZipLM outperforms DistilGPT2 while being 60% smaller and 30% faster. Our code is available at: https://github.com/IST-DASLab/ZipLM.
    Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult. (arXiv:2310.17087v1 [cs.LG])
    Large learning rates, when applied to gradient descent for nonconvex optimization, yield various implicit biases including the edge of stability (Cohen et al., 2021), balancing (Wang et al., 2022), and catapult (Lewkowycz et al., 2020). These phenomena cannot be well explained by classical optimization theory. Though significant theoretical progress has been made in understanding these implicit biases, it remains unclear for which objective functions would they occur. This paper provides an initial step in answering this question, namely that these implicit biases are in fact various tips of the same iceberg. They occur when the objective function of optimization has some good regularity, which, in combination with a provable preference of large learning rate gradient descent for moving toward flatter regions, results in these nontrivial dynamical phenomena. To establish this result, we develop a new global convergence theory under large learning rates, for a family of nonconvex functions without globally Lipschitz continuous gradient, which was typically assumed in existing convergence analysis. A byproduct is the first non-asymptotic convergence rate bound for large-learning-rate gradient descent optimization of nonconvex functions. We also validate our theory with experiments on neural networks, where different losses, activation functions, and batch normalization all can significantly affect regularity and lead to very different training dynamics.
    Explainable Spatio-Temporal Graph Neural Networks. (arXiv:2310.17149v1 [cs.LG])
    Spatio-temporal graph neural networks (STGNNs) have gained popularity as a powerful tool for effectively modeling spatio-temporal dependencies in diverse real-world urban applications, including intelligent transportation and public safety. However, the black-box nature of STGNNs limits their interpretability, hindering their application in scenarios related to urban resource allocation and policy formulation. To bridge this gap, we propose an Explainable Spatio-Temporal Graph Neural Networks (STExplainer) framework that enhances STGNNs with inherent explainability, enabling them to provide accurate predictions and faithful explanations simultaneously. Our framework integrates a unified spatio-temporal graph attention network with a positional information fusion layer as the STG encoder and decoder, respectively. Furthermore, we propose a structure distillation approach based on the Graph Information Bottleneck (GIB) principle with an explainable objective, which is instantiated by the STG encoder and decoder. Through extensive experiments, we demonstrate that our STExplainer outperforms state-of-the-art baselines in terms of predictive accuracy and explainability metrics (i.e., sparsity and fidelity) on traffic and crime prediction tasks. Furthermore, our model exhibits superior representation ability in alleviating data missing and sparsity issues. The implementation code is available at: https://github.com/HKUDS/STExplainer.
    math-PVS: A Large Language Model Framework to Map Scientific Publications to PVS Theories. (arXiv:2310.17064v1 [cs.AI])
    As artificial intelligence (AI) gains greater adoption in a wide variety of applications, it has immense potential to contribute to mathematical discovery, by guiding conjecture generation, constructing counterexamples, assisting in formalizing mathematics, and discovering connections between different mathematical areas, to name a few. While prior work has leveraged computers for exhaustive mathematical proof search, recent efforts based on large language models (LLMs) aspire to position computing platforms as co-contributors in the mathematical research process. Despite their current limitations in logic and mathematical tasks, there is growing interest in melding theorem proving systems with foundation models. This work investigates the applicability of LLMs in formalizing advanced mathematical concepts and proposes a framework that can critically review and check mathematical reasoning in research papers. Given the noted reasoning shortcomings of LLMs, our approach synergizes the capabilities of proof assistants, specifically PVS, with LLMs, enabling a bridge between textual descriptions in academic papers and formal specifications in PVS. By harnessing the PVS environment, coupled with data ingestion and conversion mechanisms, we envision an automated process, called \emph{math-PVS}, to extract and formalize mathematical theorems from research papers, offering an innovative tool for academic review and discovery.
    LLM4DyG: Can Large Language Models Solve Problems on Dynamic Graphs?. (arXiv:2310.17110v1 [cs.LG])
    In an era marked by the increasing adoption of Large Language Models (LLMs) for various tasks, there is a growing focus on exploring LLMs' capabilities in handling web data, particularly graph data. Dynamic graphs, which capture temporal network evolution patterns, are ubiquitous in real-world web data. Evaluating LLMs' competence in understanding spatial-temporal information on dynamic graphs is essential for their adoption in web applications, which remains unexplored in the literature. In this paper, we bridge the gap via proposing to evaluate LLMs' spatial-temporal understanding abilities on dynamic graphs, to the best of our knowledge, for the first time. Specifically, we propose the LLM4DyG benchmark, which includes nine specially designed tasks considering the capability evaluation of LLMs from both temporal and spatial dimensions. Then, we conduct extensive experiments to analyze the impacts of different data generators, data statistics, prompting techniques, and LLMs on the model performance. Finally, we propose Disentangled Spatial-Temporal Thoughts (DST2) for LLMs on dynamic graphs to enhance LLMs' spatial-temporal understanding abilities. Our main observations are: 1) LLMs have preliminary spatial-temporal understanding abilities on dynamic graphs, 2) Dynamic graph tasks show increasing difficulties for LLMs as the graph size and density increase, while not sensitive to the time span and data generation mechanism, 3) the proposed DST2 prompting method can help to improve LLMs' spatial-temporal understanding abilities on dynamic graphs for most tasks. The data and codes will be open-sourced at publication time.
    Hierarchical Semi-Implicit Variational Inference with Application to Diffusion Model Acceleration. (arXiv:2310.17153v1 [cs.LG])
    Semi-implicit variational inference (SIVI) has been introduced to expand the analytical variational families by defining expressive semi-implicit distributions in a hierarchical manner. However, the single-layer architecture commonly used in current SIVI methods can be insufficient when the target posterior has complicated structures. In this paper, we propose hierarchical semi-implicit variational inference, called HSIVI, which generalizes SIVI to allow more expressive multi-layer construction of semi-implicit distributions. By introducing auxiliary distributions that interpolate between a simple base distribution and the target distribution, the conditional layers can be trained by progressively matching these auxiliary distributions one layer after another. Moreover, given pre-trained score networks, HSIVI can be used to accelerate the sampling process of diffusion models with the score matching objective. We show that HSIVI significantly enhances the expressiveness of SIVI on several Bayesian inference problems with complicated target distributions. When used for diffusion model acceleration, we show that HSIVI can produce high quality samples comparable to or better than the existing fast diffusion model based samplers with a small number of function evaluations on various datasets.
    Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time. (arXiv:2310.17157v1 [cs.LG])
    Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, but existing methods either require costly retraining, have to forgo LLM's in-context learning ability, or do not yield wall-clock time speedup on modern hardware. We hypothesize that contextual sparsity, which are small, input-dependent sets of attention heads and MLP parameters that yield approximately the same output as the dense model for a given input, can address these issues. We show that contextual sparsity exists, that it can be accurately predicted, and that we can exploit it to speed up LLM inference in wall-clock time without compromising LLM's quality or in-context learning ability. Based on these insights, we propose DejaVu, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that speeds up LLM inference. We validate that DejaVu can reduce the inference latency of OPT-175B by over 2X compared to the state-of-the-art FasterTransformer, and over 6X compared to the widely used Hugging Face implementation, without compromising model quality. The code is available at https://github.com/FMInference/DejaVu.
    Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity. (arXiv:2310.17247v1 [cs.LG])
    In some settings neural networks exhibit a phenomenon known as grokking, where they achieve perfect or near-perfect accuracy on the validation set long after the same performance has been achieved on the training set. In this paper, we discover that grokking is not limited to neural networks but occurs in other settings such as Gaussian process (GP) classification, GP regression and linear regression. We also uncover a mechanism by which to induce grokking on algorithmic datasets via the addition of dimensions containing spurious information. The presence of the phenomenon in non-neural architectures provides evidence that grokking is not specific to SGD or weight norm regularisation. Instead, grokking may be possible in any setting where solution search is guided by complexity and error. Based on this insight and further trends we see in the training trajectories of a Bayesian neural network (BNN) and GP regression model, we make progress towards a more general theory of grokking. Specifically, we hypothesise that the phenomenon is governed by the accessibility of certain regions in the error and complexity landscapes.
    Bayesian Neural Networks for Geothermal Resource Assessment: Prediction with Uncertainty. (arXiv:2209.15543v3 [physics.geo-ph] UPDATED)
    We consider the application of machine learning to the evaluation of geothermal resource potential. A supervised learning problem is defined where maps of 10 geological and geophysical features within the state of Nevada, USA are used to define geothermal potential across a broad region. We have available a relatively small set of positive training sites (known resources or active power plants) and negative training sites (known drill sites with unsuitable geothermal conditions) and use these to constrain and optimize artificial neural networks for this classification task. The main objective is to predict the geothermal resource potential at unknown sites within a large geographic area where the defining features are known. These predictions could be used to target promising areas for further detailed investigations. We describe the evolution of our work from defining a specific neural network architecture to training and optimization trials. Upon analysis we expose the inevitable problems of model variability and resulting prediction uncertainty. Finally, to address these problems we apply the concept of Bayesian neural networks, a heuristic approach to regularization in network training, and make use of the practical interpretation of the formal uncertainty measures they provide.
    Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models. (arXiv:2310.17120v1 [cs.CL])
    Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.
    Codebook Features: Sparse and Discrete Interpretability for Neural Networks. (arXiv:2310.17230v1 [cs.LG])
    Understanding neural networks is challenging in part because of the dense, continuous nature of their hidden states. We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable by quantizing their continuous features into what we call codebook features. Codebook features are produced by finetuning neural networks with vector quantization bottlenecks at each layer, producing a network whose hidden features are the sum of a small number of discrete vector codes chosen from a larger codebook. Surprisingly, we find that neural networks can operate under this extreme bottleneck with only modest degradation in performance. This sparse, discrete bottleneck also provides an intuitive way of controlling neural network behavior: first, find codes that activate when the desired behavior is present, then activate those same codes during generation to elicit that behavior. We validate our approach by training codebook Transformers on several different datasets. First, we explore a finite state machine dataset with far more hidden states than neurons. In this setting, our approach overcomes the superposition problem by assigning states to distinct codes, and we find that we can make the neural network behave as if it is in a different state by activating the code for that state. Second, we train Transformer language models with up to 410M parameters on two natural language datasets. We identify codes in these models representing diverse, disentangled concepts (ranging from negative emotions to months of the year) and find that we can guide the model to generate different topics by activating the appropriate codes during inference. Overall, codebook features appear to be a promising unit of analysis and control for neural networks and interpretability. Our codebase and models are open-sourced at https://github.com/taufeeque9/codebook-features.
    MaxEnt Loss: Constrained Maximum Entropy for Calibration under Out-of-Distribution Shift. (arXiv:2310.17159v1 [cs.LG])
    We present a new loss function that addresses the out-of-distribution (OOD) calibration problem. While many objective functions have been proposed to effectively calibrate models in-distribution, our findings show that they do not always fare well OOD. Based on the Principle of Maximum Entropy, we incorporate helpful statistical constraints observed during training, delivering better model calibration without sacrificing accuracy. We provide theoretical analysis and show empirically that our method works well in practice, achieving state-of-the-art calibration on both synthetic and real-world benchmarks.
    Improving Neural Additive Models with Bayesian Principles. (arXiv:2305.16905v2 [stat.ML] UPDATED)
    Neural additive models (NAMs) can improve the interpretability of deep neural networks by handling input features in separate additive sub-networks. However, they lack inherent mechanisms that provide calibrated uncertainties and enable selection of relevant features and interactions. Approaching NAMs from a Bayesian perspective, we enhance them in three primary ways, namely by a) providing credible intervals for the individual additive sub-networks; b) estimating the marginal likelihood to perform an implicit selection of features via an empirical Bayes procedure; and c) enabling a ranking of feature pairs as candidates for second-order interaction in fine-tuned models. In particular, we develop Laplace-approximated NAMs (LA-NAMs), which show improved empirical performance on tabular datasets and challenging real-world medical tasks.
    Language-based Action Concept Spaces Improve Video Self-Supervised Learning. (arXiv:2307.10922v3 [cs.CV] UPDATED)
    Recent contrastive language image pre-training has led to learning highly transferable and robust image representations. However, adapting these models to video domains with minimal supervision remains an open problem. We explore a simple step in that direction, using language tied self-supervised learning to adapt an image CLIP model to the video domain. A backbone modified for temporal modeling is trained under self-distillation settings with train objectives operating in an action concept space. Feature vectors of various action concepts extracted from a language encoder using relevant textual prompts construct this space. We introduce two train objectives, concept distillation and concept alignment, that retain generality of original representations while enforcing relations between actions and their attributes. Our approach improves zero-shot and linear probing performance on three action recognition benchmarks.
    Seeing is not Believing: Robust Reinforcement Learning against Spurious Correlation. (arXiv:2307.07907v2 [cs.LG] UPDATED)
    Robustness has been extensively studied in reinforcement learning (RL) to handle various forms of uncertainty such as random perturbations, rare events, and malicious attacks. In this work, we consider one critical type of robustness against spurious correlation, where different portions of the state do not have correlations induced by unobserved confounders. These spurious correlations are ubiquitous in real-world tasks, for instance, a self-driving car usually observes heavy traffic in the daytime and light traffic at night due to unobservable human activity. A model that learns such useless or even harmful correlation could catastrophically fail when the confounder in the test case deviates from the training one. Although motivated, enabling robustness against spurious correlation poses significant challenges since the uncertainty set, shaped by the unobserved confounder and causal structure, is difficult to characterize and identify. Existing robust algorithms that assume simple and unstructured uncertainty sets are therefore inadequate to address this challenge. To solve this issue, we propose Robust State-Confounded Markov Decision Processes (RSC-MDPs) and theoretically demonstrate its superiority in avoiding learning spurious correlations compared with other robust RL counterparts. We also design an empirical algorithm to learn the robust optimal policy for RSC-MDPs, which outperforms all baselines in eight realistic self-driving and manipulation tasks.
    MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models. (arXiv:2310.02255v2 [cs.CV] UPDATED)
    Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/.
    Look Beneath the Surface: Exploiting Fundamental Symmetry for Sample-Efficient Offline RL. (arXiv:2306.04220v4 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) offers an appealing approach to real-world tasks by learning policies from pre-collected datasets without interacting with the environment. However, the performance of existing offline RL algorithms heavily depends on the scale and state-action space coverage of datasets. Real-world data collection is often expensive and uncontrollable, leading to small and narrowly covered datasets and posing significant challenges for practical deployments of offline RL. In this paper, we provide a new insight that leveraging the fundamental symmetry of system dynamics can substantially enhance offline RL performance under small datasets. Specifically, we propose a Time-reversal symmetry (T-symmetry) enforced Dynamics Model (TDM), which establishes consistency between a pair of forward and reverse latent dynamics. TDM provides both well-behaved representations for small datasets and a new reliability measure for OOD samples based on compliance with the T-symmetry. These can be readily used to construct a new offline RL algorithm (TSRL) with less conservative policy constraints and a reliable latent space data augmentation procedure. Based on extensive experiments, we find TSRL achieves great performance on small benchmark datasets with as few as 1% of the original samples, which significantly outperforms the recent offline RL algorithms in terms of data efficiency and generalizability.Code is available at: https://github.com/pcheng2/TSRL
    A framework for benchmarking clustering algorithms. (arXiv:2209.09493v3 [cs.LG] UPDATED)
    The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark dataset collections referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at .
    COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers. (arXiv:2309.01270v2 [cs.CV] UPDATED)
    We present COMEDIAN, a novel pipeline to initialize spatiotemporal transformers for action spotting, which involves self-supervised learning and knowledge distillation. Action spotting is a timestamp-level temporal action detection task. Our pipeline consists of three steps, with two initialization stages. First, we perform self-supervised initialization of a spatial transformer using short videos as input. Additionally, we initialize a temporal transformer that enhances the spatial transformer's outputs with global context through knowledge distillation from a pre-computed feature bank aligned with each short video segment. In the final step, we fine-tune the transformers to the action spotting task. The experiments, conducted on the SoccerNet-v2 dataset, demonstrate state-of-the-art performance and validate the effectiveness of COMEDIAN's pretraining paradigm. Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.
    Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language Models. (arXiv:2310.17530v1 [cs.CV])
    Pretrained machine learning models are known to perpetuate and even amplify existing biases in data, which can result in unfair outcomes that ultimately impact user experience. Therefore, it is crucial to understand the mechanisms behind those prejudicial biases to ensure that model performance does not result in discriminatory behaviour toward certain groups or populations. In this work, we define gender bias as our case study. We quantify bias amplification in pretraining and after fine-tuning on three families of vision-and-language models. We investigate the connection, if any, between the two learning stages, and evaluate how bias amplification reflects on model performance. Overall, we find that bias amplification in pretraining and after fine-tuning are independent. We then examine the effect of continued pretraining on gender-neutral data, finding that this reduces group disparities, i.e., promotes fairness, on VQAv2 and retrieval tasks without significantly compromising task performance.
    Bifurcations and loss jumps in RNN training. (arXiv:2310.17561v1 [cs.LG])
    Recurrent neural networks (RNNs) are popular machine learning tools for modeling and forecasting sequential data and for inferring dynamical systems (DS) from observed time series. Concepts from DS theory (DST) have variously been used to further our understanding of both, how trained RNNs solve complex tasks, and the training process itself. Bifurcations are particularly important phenomena in DS, including RNNs, that refer to topological (qualitative) changes in a system's dynamical behavior as one or more of its parameters are varied. Knowing the bifurcation structure of an RNN will thus allow to deduce many of its computational and dynamical properties, like its sensitivity to parameter variations or its behavior during training. In particular, bifurcations may account for sudden loss jumps observed in RNN training that could severely impede the training process. Here we first mathematically prove for a particular class of ReLU-based RNNs that certain bifurcations are indeed associated with loss gradients tending toward infinity or zero. We then introduce a novel heuristic algorithm for detecting all fixed points and k-cycles in ReLU-based RNNs and their existence and stability regions, hence bifurcation manifolds in parameter space. In contrast to previous numerical algorithms for finding fixed points and common continuation methods, our algorithm provides exact results and returns fixed points and cycles up to high orders with surprisingly good scaling behavior. We exemplify the algorithm on the analysis of the training process of RNNs, and find that the recently introduced technique of generalized teacher forcing completely avoids certain types of bifurcations in training. Thus, besides facilitating the DST analysis of trained RNNs, our algorithm provides a powerful instrument for analyzing the training process itself.
    Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction. (arXiv:2301.08951v4 [cs.CV] UPDATED)
    When perceiving the world from multiple viewpoints, humans have the ability to reason about the complete objects in a compositional manner even when an object is completely occluded from certain viewpoints. Meanwhile, humans are able to imagine novel views after observing multiple viewpoints. Recent remarkable advances in multi-view object-centric learning still leaves some unresolved problems: 1) The shapes of partially or completely occluded objects can not be well reconstructed. 2) The novel viewpoint prediction depends on expensive viewpoint annotations rather than implicit rules in view representations. In this paper, we introduce a time-conditioned generative model for videos. To reconstruct the complete shape of an object accurately, we enhance the disentanglement between the latent representations of objects and views, where the latent representations of time-conditioned views are jointly inferred with a Transformer and then are input to a sequential extension of Slot Attention to learn object-centric representations. In addition, Gaussian processes are employed as priors of view latent variables for video generation and novel-view prediction without viewpoint annotations. Experiments on multiple datasets demonstrate that the proposed model can make object-centric video decomposition, reconstruct the complete shapes of occluded objects, and make novel-view predictions.
    Bounding Box-based Multi-objective Bayesian Optimization of Risk Measures under Input Uncertainty. (arXiv:2301.11588v2 [stat.ML] UPDATED)
    In this study, we propose a novel multi-objective Bayesian optimization (MOBO) method to efficiently identify the Pareto front (PF) defined by risk measures for black-box functions under the presence of input uncertainty (IU). Existing BO methods for Pareto optimization in the presence of IU are risk-specific or without theoretical guarantees, whereas our proposed method addresses general risk measures and has theoretical guarantees. The basic idea of the proposed method is to assume a Gaussian process (GP) model for the black-box function and to construct high-probability bounding boxes for the risk measures using the GP model. Furthermore, in order to reduce the uncertainty of non-dominated bounding boxes, we propose a method of selecting the next evaluation point using a maximin distance defined by the maximum value of a quasi distance based on bounding boxes. As theoretical analysis, we prove that the algorithm can return an arbitrary-accurate solution in a finite number of iterations with high probability, for various risk measures such as Bayes risk, worst-case risk, and value-at-risk. We also give a theoretical analysis that takes into account approximation errors because there exist non-negligible approximation errors (e.g., finite approximation of PFs and sampling-based approximation of bounding boxes) in practice. We confirm that the proposed method outperforms compared with existing methods not only in the setting with IU but also in the setting of ordinary MOBO through numerical experiments.
    Trust, but Verify: Robust Image Segmentation using Deep Learning. (arXiv:2310.16999v1 [cs.CV])
    We describe a method for verifying the output of a deep neural network for medical image segmentation that is robust to several classes of random as well as worst-case perturbations i.e. adversarial attacks. This method is based on a general approach recently developed by the authors called ``Trust, but Verify" wherein an auxiliary verification network produces predictions about certain masked features in the input image using the segmentation as an input. A well-designed auxiliary network will produce high-quality predictions when the input segmentations are accurate, but will produce low-quality predictions when the segmentations are incorrect. Checking the predictions of such a network with the original image allows us to detect bad segmentations. However, to ensure the verification method is truly robust, we need a method for checking the quality of the predictions that does not itself rely on a black-box neural network. Indeed, we show that previous methods for segmentation evaluation that do use deep neural regression networks are vulnerable to false negatives i.e. can inaccurately label bad segmentations as good. We describe the design of a verification network that avoids such vulnerability and present results to demonstrate its robustness compared to previous methods.
    The statistical thermodynamics of generative diffusion models. (arXiv:2310.17467v1 [stat.ML])
    Generative diffusion models have achieved spectacular performance in many areas of generative modeling. While the fundamental ideas behind these models come from non-equilibrium physics, in this paper we show that many aspects of these models can be understood using the tools of equilibrium statistical mechanics. Using this reformulation, we show that generative diffusion models undergo second-order phase transitions corresponding to symmetry breaking phenomena. We argue that this lead to a form of instability that lies at the heart of their generative capabilities and that can be described by a set of mean field critical exponents. We conclude by analyzing recent work connecting diffusion models and associative memory networks in view of the thermodynamic formulations.
    Bias in Evaluation Processes: An Optimization-Based Model. (arXiv:2310.17489v1 [cs.CY])
    Biases with respect to socially-salient attributes of individuals have been well documented in evaluation processes used in settings such as admissions and hiring. We view such an evaluation process as a transformation of a distribution of the true utility of an individual for a task to an observed distribution and model it as a solution to a loss minimization problem subject to an information constraint. Our model has two parameters that have been identified as factors leading to biases: the resource-information trade-off parameter in the information constraint and the risk-averseness parameter in the loss function. We characterize the distributions that arise from our model and study the effect of the parameters on the observed distribution. The outputs of our model enrich the class of distributions that can be used to capture variation across groups in the observed evaluations. We empirically validate our model by fitting real-world datasets and use it to study the effect of interventions in a downstream selection task. These results contribute to an understanding of the emergence of bias in evaluation processes and provide tools to guide the deployment of interventions to mitigate biases.
    Transferring a molecular foundation model for polymer property predictions. (arXiv:2310.16958v1 [cs.LG])
    Transformer-based large language models have remarkable potential to accelerate design optimization for applications such as drug development and materials discovery. Self-supervised pretraining of transformer models requires large-scale datasets, which are often sparsely populated in topical areas such as polymer science. State-of-the-art approaches for polymers conduct data augmentation to generate additional samples but unavoidably incurs extra computational costs. In contrast, large-scale open-source datasets are available for small molecules and provide a potential solution to data scarcity through transfer learning. In this work, we show that using transformers pretrained on small molecules and fine-tuned on polymer properties achieve comparable accuracy to those trained on augmented polymer datasets for a series of benchmark prediction tasks.
    BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs' Generation. (arXiv:2310.17054v1 [cs.CL])
    Large language models (LLMs) such as GPT-3 have demonstrated a strong capability to generate coherent and contextually relevant text. However, amidst their successes, a crucial issue persists: their generated outputs still lack commonsense at times. Moreover, fine-tuning the entire LLM towards more commonsensical outputs is computationally expensive if not infeasible. In this paper, we present a computation-efficient framework that steers a frozen Pre-Trained Language Model (PTLM) towards more commonsensical generation (i.e., producing a plausible output that incorporates a list of concepts in a meaningful way). Specifically, we first construct a reference-free evaluator that assigns a sentence with a commonsensical score by grounding the sentence to a dynamic commonsense knowledge base from four different relational aspects. We then use the scorer as the oracle for commonsense knowledge, and extend the controllable generation method called NADO to train an auxiliary head that guides a fixed PTLM to better satisfy the oracle. We test our framework on a series of GPT-2-, Flan-T5-, and Alpaca-based language models (LMs) on two constrained concept-to-sentence benchmarks. Human evaluation results demonstrate that our method consistently leads to the most commonsensical outputs.
    On the Convergence of CART under Sufficient Impurity Decrease Condition. (arXiv:2310.17114v1 [stat.ML])
    The decision tree is a flexible machine learning model that finds its success in numerous applications. It is usually fitted in a recursively greedy manner using CART. In this paper, we investigate the convergence rate of CART under a regression setting. First, we establish an upper bound on the prediction error of CART under a sufficient impurity decrease (SID) condition \cite{chi2022asymptotic} -- our result improves upon the known result by \cite{chi2022asymptotic} under a similar assumption. Furthermore, we provide examples that demonstrate the error bound cannot be further improved by more than a constant or a logarithmic factor. Second, we introduce a set of easily verifiable sufficient conditions for the SID condition. Specifically, we demonstrate that the SID condition can be satisfied in the case of an additive model, provided that the component functions adhere to a ``locally reverse Poincar{\'e} inequality". We discuss several well-known function classes in non-parametric estimation to illustrate the practical utility of this concept.
    Harnessing the Power of Choices in Decision Tree Learning. (arXiv:2310.01551v2 [cs.LG] UPDATED)
    We propose a simple generalization of standard and empirically successful decision tree learning algorithms such as ID3, C4.5, and CART. These algorithms, which have been central to machine learning for decades, are greedy in nature: they grow a decision tree by iteratively splitting on the best attribute. Our algorithm, Top-$k$, considers the $k$ best attributes as possible splits instead of just the single best attribute. We demonstrate, theoretically and empirically, the power of this simple generalization. We first prove a {\sl greediness hierarchy theorem} showing that for every $k \in \mathbb{N}$, Top-$(k+1)$ can be dramatically more powerful than Top-$k$: there are data distributions for which the former achieves accuracy $1-\varepsilon$, whereas the latter only achieves accuracy $\frac1{2}+\varepsilon$. We then show, through extensive experiments, that Top-$k$ outperforms the two main approaches to decision tree learning: classic greedy algorithms and more recent "optimal decision tree" algorithms. On one hand, Top-$k$ consistently enjoys significant accuracy gains over greedy algorithms across a wide range of benchmarks. On the other hand, Top-$k$ is markedly more scalable than optimal decision tree algorithms and is able to handle dataset and feature set sizes that remain far beyond the reach of these algorithms.
    Privately Aligning Language Models with Reinforcement Learning. (arXiv:2310.16960v1 [cs.LG])
    Positioned between pre-training and user deployment, aligning large language models (LLMs) through reinforcement learning (RL) has emerged as a prevailing strategy for training instruction following-models such as ChatGPT. In this work, we initiate the study of privacy-preserving alignment of LLMs through Differential Privacy (DP) in conjunction with RL. Following the influential work of Ziegler et al. (2020), we study two dominant paradigms: (i) alignment via RL without human in the loop (e.g., positive review generation) and (ii) alignment via RL from human feedback (RLHF) (e.g., summarization in a human-preferred way). We give a new DP framework to achieve alignment via RL, and prove its correctness. Our experimental results validate the effectiveness of our approach, offering competitive utility while ensuring strong privacy protections.
    Learning to Rank for Active Learning via Multi-Task Bilevel Optimization. (arXiv:2310.17044v1 [cs.LG])
    Active learning is a promising paradigm to reduce the labeling cost by strategically requesting labels to improve model performance. However, existing active learning methods often rely on expensive acquisition function to compute, extensive modeling retraining and multiple rounds of interaction with annotators. To address these limitations, we propose a novel approach for active learning, which aims to select batches of unlabeled instances through a learned surrogate model for data acquisition. A key challenge in this approach is developing an acquisition function that generalizes well, as the history of data, which forms part of the utility function's input, grows over time. Our novel algorithmic contribution is a bilevel multi-task bilevel optimization framework that predicts the relative utility -- measured by the validation accuracy -- of different training sets, and ensures the learned acquisition function generalizes effectively. For cases where validation accuracy is expensive to evaluate, we introduce efficient interpolation-based surrogate models to estimate the utility function, reducing the evaluation cost. We demonstrate the performance of our approach through extensive experiments on standard active classification benchmarks. By employing our learned utility function, we show significant improvements over traditional techniques, paving the way for more efficient and effective utility maximization in active learning applications.
    Strategizing EV Charging and Renewable Integration in Texas. (arXiv:2310.17056v1 [eess.SY])
    Exploring the convergence of electric vehicles (EVs), renewable energy, and smart grid technologies in the context of Texas, this study addresses challenges hindering the widespread adoption of EVs. Acknowledging their environmental benefits, the research focuses on grid stability concerns, uncoordinated charging patterns, and the complicated relationship between EVs and renewable energy sources. Dynamic time warping (DTW) clustering and k-means clustering methodologies categorize days based on total load and net load, offering nuanced insights into daily electricity consumption and renewable energy generation patterns. By establishing optimal charging and vehicle-to-grid (V2G) windows tailored to specific load characteristics, the study provides a sophisticated methodology for strategic decision-making in energy consumption and renewable integration. The findings contribute to the ongoing discourse on achieving a sustainable and resilient energy future through the seamless integration of EVs into smart grids.
    Isometric Motion Manifold Primitives. (arXiv:2310.17072v1 [cs.AI])
    The Motion Manifold Primitive (MMP) produces, for a given task, a continuous manifold of trajectories each of which can successfully complete the task. It consists of the decoder function that parametrizes the manifold and the probability density in the latent coordinate space. In this paper, we first show that the MMP performance can significantly degrade due to the geometric distortion in the latent space -- by distortion, we mean that similar motions are not located nearby in the latent space. We then propose {\it Isometric Motion Manifold Primitives (IMMP)} whose latent coordinate space preserves the geometry of the manifold. For this purpose, we formulate and use a Riemannian metric for the motion space (i.e., parametric curve space), which we call a {\it CurveGeom Riemannian metric}. Experiments with planar obstacle-avoiding motions and pushing manipulation tasks show that IMMP significantly outperforms existing MMP methods. Code is available at https://github.com/Gabe-YHLee/IMMP-public.
    Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models. (arXiv:2310.17086v1 [cs.LG])
    Transformers are remarkably good at in-context learning (ICL) -- learning from demonstrations without parameter updates -- but how they perform ICL remains a mystery. Recent work suggests that Transformers may learn in-context by internally running Gradient Descent, a first-order optimization method. In this paper, we instead demonstrate that Transformers learn to implement higher-order optimization methods to perform ICL. Focusing on in-context linear regression, we show that Transformers learn to implement an algorithm very similar to Iterative Newton's Method, a higher-order optimization method, rather than Gradient Descent. Empirically, we show that predictions from successive Transformer layers closely match different iterations of Newton's Method linearly, with each middle layer roughly computing 3 iterations. In contrast, exponentially more Gradient Descent steps are needed to match an additional Transformers layer; this suggests that Transformers have an comparable rate of convergence with high-order methods such as Iterative Newton, which are exponentially faster than Gradient Descent. We also show that Transformers can learn in-context on ill-conditioned data, a setting where Gradient Descent struggles but Iterative Newton succeeds. Finally, we show theoretical results which support our empirical findings and have a close correspondence with them: we prove that Transformers can implement $k$ iterations of Newton's method with $\mathcal{O}(k)$ layers.
    Efficient Neural Network Approaches for Conditional Optimal Transport with Applications in Bayesian Inference. (arXiv:2310.16975v1 [stat.ML])
    We present two neural network approaches that approximate the solutions of static and dynamic conditional optimal transport (COT) problems, respectively. Both approaches enable sampling and density estimation of conditional probability distributions, which are core tasks in Bayesian inference. Our methods represent the target conditional distributions as transformations of a tractable reference distribution and, therefore, fall into the framework of measure transport. COT maps are a canonical choice within this framework, with desirable properties such as uniqueness and monotonicity. However, the associated COT problems are computationally challenging, even in moderate dimensions. To improve the scalability, our numerical algorithms leverage neural networks to parameterize COT maps. Our methods exploit the structure of the static and dynamic formulations of the COT problem. PCP-Map models conditional transport maps as the gradient of a partially input convex neural network (PICNN) and uses a novel numerical implementation to increase computational efficiency compared to state-of-the-art alternatives. COT-Flow models conditional transports via the flow of a regularized neural ODE; it is slower to train but offers faster sampling. We demonstrate their effectiveness and efficiency by comparing them with state-of-the-art approaches using benchmark datasets and Bayesian inverse problems.
    Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark. (arXiv:2310.16981v1 [cs.LG])
    Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper addresses this issue by exploring the potential of integrating data-centric AI techniques which profile the data to guide the synthetic data generation process. Moreover, we shed light on the often ignored consequences of neglecting these data profiles during synthetic data generation -- despite seemingly high statistical fidelity. Subsequently, we propose a novel framework to evaluate the integration of data profiles to guide the creation of more representative synthetic data. In an empirical study, we evaluate the performance of five state-of-the-art models for tabular data generation on eleven distinct tabular datasets. The findings offer critical insights into the successes and limitations of current synthetic data generation techniques. Finally, we provide practical recommendations for integrating data-centric insights into the synthetic data generation process, with a specific focus on classification performance, model selection, and feature selection. This study aims to reevaluate conventional approaches to synthetic data generation and promote the application of data-centric AI techniques in improving the quality and effectiveness of synthetic data.
    Conditionally Combining Robot Skills using Large Language Models. (arXiv:2310.17019v1 [cs.LG])
    This paper combines two contributions. First, we introduce an extension of the Meta-World benchmark, which we call "Language-World," which allows a large language model to operate in a simulated robotic environment using semi-structured natural language queries and scripted skills described using natural language. By using the same set of tasks as Meta-World, Language-World results can be easily compared to Meta-World results, allowing for a point of comparison between recent methods using Large Language Models (LLMs) and those using Deep Reinforcement Learning. Second, we introduce a method we call Plan Conditioned Behavioral Cloning (PCBC), that allows finetuning the behavior of high-level plans using end-to-end demonstrations. Using Language-World, we show that PCBC is able to achieve strong performance in a variety of few-shot regimes, often achieving task generalization with as little as a single demonstration. We have made Language-World available as open-source software at https://github.com/krzentner/language-world/.
    Streaming Factor Trajectory Learning for Temporal Tensor Decomposition. (arXiv:2310.17021v1 [cs.LG])
    Practical tensor data is often along with time information. Most existing temporal decomposition approaches estimate a set of fixed factors for the objects in each tensor mode, and hence cannot capture the temporal evolution of the objects' representation. More important, we lack an effective approach to capture such evolution from streaming data, which is common in real-world applications. To address these issues, we propose Streaming Factor Trajectory Learning (SFTL) for temporal tensor decomposition. We use Gaussian processes (GPs) to model the trajectory of factors so as to flexibly estimate their temporal evolution. To address the computational challenges in handling streaming data, we convert the GPs into a state-space prior by constructing an equivalent stochastic differential equation (SDE). We develop an efficient online filtering algorithm to estimate a decoupled running posterior of the involved factor states upon receiving new data. The decoupled estimation enables us to conduct standard Rauch-Tung-Striebel smoothing to compute the full posterior of all the trajectories in parallel, without the need for revisiting any previous data. We have shown the advantage of SFTL in both synthetic tasks and real-world applications.
    Road Network Guided Fine-Grained Urban Traffic Flow Inference. (arXiv:2109.14251v3 [cs.LG] UPDATED)
    Accurate inference of fine-grained traffic flow from coarse-grained one is an emerging yet crucial problem, which can help greatly reduce the number of the required traffic monitoring sensors for cost savings. In this work, we notice that traffic flow has a high correlation with road network, which was either completely ignored or simply treated as an external factor in previous works. To facilitate this problem, we propose a novel Road-Aware Traffic Flow Magnifier (RATFM) that explicitly exploits the prior knowledge of road networks to fully learn the road-aware spatial distribution of fine-grained traffic flow. Specifically, a multi-directional 1D convolutional layer is first introduced to extract the semantic feature of the road network. Subsequently, we incorporate the road network feature and coarse-grained flow feature to regularize the short-range spatial distribution modeling of road-relative traffic flow. Furthermore, we take the road network feature as a query to capture the long-range spatial distribution of traffic flow with a transformer architecture. Benefiting from the road-aware inference mechanism, our method can generate high-quality fine-grained traffic flow maps. Extensive experiments on three real-world datasets show that the proposed RATFM outperforms state-of-the-art models under various scenarios. Our code and datasets are released at {\url{https://github.com/luimoli/RATFM}}.
    Label Embedding via Low-Coherence Matrices. (arXiv:2305.19470v3 [cs.LG] UPDATED)
    Label embedding is a framework for multiclass classification problems where each label is represented by a distinct vector of some fixed dimension, and training involves matching model output to the vector representing the correct label. While label embedding has been successfully applied in extreme classification and zero-shot learning, and offers both computational and statistical advantages, its theoretical foundations remain poorly understood. This work presents an analysis of label embedding in the context of extreme multiclass classification, where the number of classes $C$ is very large. We present an excess risk bound that reveals a trade-off between computational and statistical efficiency, quantified via the coherence of the embedding matrix. We further show that under the Massart noise condition, the statistical penalty for label embedding vanishes with sufficiently low coherence. Our analysis supports an algorithm that is simple, scalable, and easily parallelizable, and experimental results demonstrate its effectiveness in large-scale applications.
    Controlled Decoding from Language Models. (arXiv:2310.17022v1 [cs.LG])
    We propose controlled decoding (CD), a novel off-policy reinforcement learning method to control the autoregressive generation from language models towards high reward outcomes. CD solves an off-policy reinforcement learning problem through a value function for the reward, which we call a prefix scorer. The prefix scorer is used at inference time to steer the generation towards higher reward outcomes. We show that the prefix scorer may be trained on (possibly) off-policy data to predict the expected reward when decoding is continued from a partially decoded response. We empirically demonstrate that CD is effective as a control mechanism on Reddit conversations corpus. We also show that the modularity of the design of CD makes it possible to control for multiple rewards, effectively solving a multi-objective reinforcement learning problem with no additional complexity. Finally, we show that CD can be applied in a novel blockwise fashion at inference-time, again without the need for any training-time changes, essentially bridging the gap between the popular best-of-$K$ strategy and token-level reinforcement learning. This makes CD a promising approach for alignment of language models.
    Towards Matching Phones and Speech Representations. (arXiv:2310.17558v1 [cs.CL])
    Learning phone types from phone instances has been a long-standing problem, while still being open. In this work, we revisit this problem in the context of self-supervised learning, and pose it as the problem of matching cluster centroids to phone embeddings. We study two key properties that enable matching, namely, whether cluster centroids of self-supervised representations reduce the variability of phone instances and respect the relationship among phones. We then use the matching result to produce pseudo-labels and introduce a new loss function for improving self-supervised representations. Our experiments show that the matching result captures the relationship among phones. Training the new loss function jointly with the regular self-supervised losses, such as APC and CPC, significantly improves the downstream phone classification.
    Lithium Metal Battery Quality Control via Transformer-CNN Segmentation. (arXiv:2302.04824v2 [cs.CV] UPDATED)
    Lithium metal battery (LMB) has the potential to be the next-generation battery system because of its high theoretical energy density. However, defects known as dendrites are formed by heterogeneous lithium (Li) plating, which hinders the development and utilization of LMBs. Non-destructive techniques to observe the dendrite morphology often use X-ray computed tomography (XCT) to provide cross-sectional views. To retrieve three-dimensional structures inside a battery, image segmentation becomes essential to quantitatively analyze XCT images. This work proposes a new semantic segmentation approach using a transformer-based neural network called TransforCNN that is capable of segmenting out dendrites from XCT data. In addition, we compare the performance of the proposed TransforCNN with three other algorithms, such as U-Net, Y-Net, and E-Net, consisting of an Ensemble Network model for XCT analysis. Our results show the advantages of using TransforCNN when evaluating over-segmentation metrics, such as mean Intersection over Union (mIoU) and mean Dice Similarity Coefficient (mDSC) as well as through several qualitatively comparative visualizations.
    Statistically Valid Variable Importance Assessment through Conditional Permutations. (arXiv:2309.07593v2 [cs.LG] UPDATED)
    Variable importance assessment has become a crucial step in machine-learning applications when using complex learners, such as deep neural networks, on large-scale data. Removal-based importance assessment is currently the reference approach, particularly when statistical guarantees are sought to justify variable inclusion. It is often implemented with variable permutation schemes. On the flip side, these approaches risk misidentifying unimportant variables as important in the presence of correlations among covariates. Here we develop a systematic approach for studying Conditional Permutation Importance (CPI) that is model agnostic and computationally lean, as well as reusable benchmarks of state-of-the-art variable importance estimators. We show theoretically and empirically that $\textit{CPI}$ overcomes the limitations of standard permutation importance by providing accurate type-I error control. When used with a deep neural network, $\textit{CPI}$ consistently showed top accuracy across benchmarks. An experiment on real-world data analysis in a large-scale medical dataset showed that $\textit{CPI}$ provides a more parsimonious selection of statistically significant variables. Our results suggest that $\textit{CPI}$ can be readily used as drop-in replacement for permutation-based methods.
    Looping in the Human: Collaborative and Explainable Bayesian Optimization. (arXiv:2310.17273v1 [cs.LG])
    Like many optimizers, Bayesian optimization often falls short of gaining user trust due to opacity. While attempts have been made to develop human-centric optimizers, they typically assume user knowledge is well-specified and error-free, employing users mainly as supervisors of the optimization process. We relax these assumptions and propose a more balanced human-AI partnership with our Collaborative and Explainable Bayesian Optimization (CoExBO) framework. Instead of explicitly requiring a user to provide a knowledge model, CoExBO employs preference learning to seamlessly integrate human insights into the optimization, resulting in algorithmic suggestions that resonate with user preference. CoExBO explains its candidate selection every iteration to foster trust, empowering users with a clearer grasp of the optimization. Furthermore, CoExBO offers a no-harm guarantee, allowing users to make mistakes; even with extreme adversarial interventions, the algorithm converges asymptotically to a vanilla Bayesian optimization. We validate CoExBO's efficacy through human-AI teaming experiments in lithium-ion battery design, highlighting substantial improvements over conventional methods.
    Neuro-Inspired Fragmentation and Recall to Overcome Catastrophic Forgetting in Curiosity. (arXiv:2310.17537v1 [cs.AI])
    Deep reinforcement learning methods exhibit impressive performance on a range of tasks but still struggle on hard exploration tasks in large environments with sparse rewards. To address this, intrinsic rewards can be generated using forward model prediction errors that decrease as the environment becomes known, and incentivize an agent to explore novel states. While prediction-based intrinsic rewards can help agents solve hard exploration tasks, they can suffer from catastrophic forgetting and actually increase at visited states. We first examine the conditions and causes of catastrophic forgetting in grid world environments. We then propose a new method FARCuriosity, inspired by how humans and animals learn. The method depends on fragmentation and recall: an agent fragments an environment based on surprisal, and uses different local curiosity modules (prediction-based intrinsic reward functions) for each fragment so that modules are not trained on the entire environment. At each fragmentation event, the agent stores the current module in long-term memory (LTM) and either initializes a new module or recalls a previously stored module based on its match with the current state. With fragmentation and recall, FARCuriosity achieves less forgetting and better overall performance in games with varied and heterogeneous environments in the Atari benchmark suite of tasks. Thus, this work highlights the problem of catastrophic forgetting in prediction-based curiosity methods and proposes a solution.
    Efficient Numerical Algorithm for Large-Scale Damped Natural Gradient Descent. (arXiv:2310.17556v1 [cs.LG])
    We propose a new algorithm for efficiently solving the damped Fisher matrix in large-scale scenarios where the number of parameters significantly exceeds the number of available samples. This problem is fundamental for natural gradient descent and stochastic reconfiguration. Our algorithm is based on Cholesky decomposition and is generally applicable. Benchmark results show that the algorithm is significantly faster than existing methods.
    Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning. (arXiv:2203.09249v2 [cs.LG] UPDATED)
    Federated Learning (FL) is an emerging distributed learning paradigm under privacy constraint. Data heterogeneity is one of the main challenges in FL, which results in slow convergence and degraded performance. Most existing approaches only tackle the heterogeneity challenge by restricting the local model update in client, ignoring the performance drop caused by direct global model aggregation. Instead, we propose a data-free knowledge distillation method to fine-tune the global model in the server (FedFTG), which relieves the issue of direct model aggregation. Concretely, FedFTG explores the input space of local models through a generator, and uses it to transfer the knowledge from local models to the global model. Besides, we propose a hard sample mining scheme to achieve effective knowledge distillation throughout the training. In addition, we develop customized label sampling and class-level ensemble to derive maximum utilization of knowledge, which implicitly mitigates the distribution discrepancy across clients. Extensive experiments show that our FedFTG significantly outperforms the state-of-the-art (SOTA) FL algorithms and can serve as a strong plugin for enhancing FedAvg, FedProx, FedDyn, and SCAFFOLD.
    Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals. (arXiv:2302.04449v3 [cs.LG] UPDATED)
    High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. An auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. Experimentally, various RL algorithms obtain significant improvement in performance and training speed when assisted by our design.
    SoK: Pitfalls in Evaluating Black-Box Attacks. (arXiv:2310.17534v1 [cs.CR])
    Numerous works study black-box attacks on image classifiers. However, these works make different assumptions on the adversary's knowledge and current literature lacks a cohesive organization centered around the threat model. To systematize knowledge in this area, we propose a taxonomy over the threat space spanning the axes of feedback granularity, the access of interactive queries, and the quality and quantity of the auxiliary data available to the attacker. Our new taxonomy provides three key insights. 1) Despite extensive literature, numerous under-explored threat spaces exist, which cannot be trivially solved by adapting techniques from well-explored settings. We demonstrate this by establishing a new state-of-the-art in the less-studied setting of access to top-k confidence scores by adapting techniques from well-explored settings of accessing the complete confidence vector, but show how it still falls short of the more restrictive setting that only obtains the prediction label, highlighting the need for more research. 2) Identification the threat model of different attacks uncovers stronger baselines that challenge prior state-of-the-art claims. We demonstrate this by enhancing an initially weaker baseline (under interactive query access) via surrogate models, effectively overturning claims in the respective paper. 3) Our taxonomy reveals interactions between attacker knowledge that connect well to related areas, such as model inversion and extraction attacks. We discuss how advances in other areas can enable potentially stronger black-box attacks. Finally, we emphasize the need for a more realistic assessment of attack success by factoring in local attack runtime. This approach reveals the potential for certain attacks to achieve notably higher success rates and the need to evaluate attacks in diverse and harder settings, highlighting the need for better selection criteria.
    Little Exploration is All You Need. (arXiv:2310.17538v1 [cs.LG])
    The prevailing principle of "Optimism in the Face of Uncertainty" advocates for the incorporation of an exploration bonus, generally assumed to be proportional to the inverse square root of the visit count ($1/\sqrt{n}$), where $n$ is the number of visits to a particular state-action pair. This approach, however, exclusively focuses on "uncertainty," neglecting the inherent "difficulty" of different options. To address this gap, we introduce a novel modification of standard UCB algorithm in the multi-armed bandit problem, proposing an adjusted bonus term of $1/n^\tau$, where $\tau > 1/2$, that accounts for task difficulty. Our proposed algorithm, denoted as UCB$^\tau$, is substantiated through comprehensive regret and risk analyses, confirming its theoretical robustness. Comparative evaluations with standard UCB and Thompson Sampling algorithms on synthetic datasets demonstrate that UCB$^\tau$ not only outperforms in efficacy but also exhibits lower risk across various environmental conditions and hyperparameter settings.
    Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions. (arXiv:2310.17502v1 [cs.SD])
    Customizing voice and speaking style in a speech synthesis system with intuitive and fine-grained controls is challenging, given that little data with appropriate labels is available. Furthermore, editing an existing human's voice also comes with ethical concerns. In this paper, we propose a method to generate artificial speaker embeddings that cannot be linked to a real human while offering intuitive and fine-grained control over the voice and speaking style of the embeddings, without requiring any labels for speaker or style. The artificial and controllable embeddings can be fed to a speech synthesis system, conditioned on embeddings of real humans during training, without sacrificing privacy during inference.
    Cross-feature Contrastive Loss for Decentralized Deep Learning on Heterogeneous Data. (arXiv:2310.15890v2 [cs.LG] UPDATED)
    The current state-of-the-art decentralized learning algorithms mostly assume the data distribution to be Independent and Identically Distributed (IID). However, in practical scenarios, the distributed datasets can have significantly heterogeneous data distributions across the agents. In this work, we present a novel approach for decentralized learning on heterogeneous data, where data-free knowledge distillation through contrastive loss on cross-features is utilized to improve performance. Cross-features for a pair of neighboring agents are the features (i.e., last hidden layer activations) obtained from the data of an agent with respect to the model parameters of the other agent. We demonstrate the effectiveness of the proposed technique through an exhaustive set of experiments on various Computer Vision datasets (CIFAR-10, CIFAR-100, Fashion MNIST, Imagenette, and ImageNet), model architectures, and network topologies. Our experiments show that the proposed method achieves superior performance (0.2-4% improvement in test accuracy) compared to other existing techniques for decentralized learning on heterogeneous data.
    Improving Few-Shot Learning through Multi-task Representation Learning Theory. (arXiv:2010.01992v3 [cs.LG] CROSS LISTED)
    In this paper, we consider the framework of multi-task representation (MTR) learning where the goal is to use source tasks to learn a representation that reduces the sample complexity of solving a target task. We start by reviewing recent advances in MTR theory and show that they can provide novel insights for popular meta-learning algorithms when analyzed within this framework. In particular, we highlight a fundamental difference between gradient-based and metric-based algorithms in practice and put forward a theoretical analysis to explain it. Finally, we use the derived insights to improve the performance of meta-learning methods via a new spectral-based regularization term and confirm its efficiency through experimental studies on few-shot classification benchmarks. To the best of our knowledge, this is the first contribution that puts the most recent learning bounds of MTR theory into practice for the task of few-shot classification.
    StochGradAdam: Accelerating Neural Networks Training with Stochastic Gradient Sampling. (arXiv:2310.17042v1 [cs.LG])
    In the rapidly advancing domain of deep learning optimization, this paper unveils the StochGradAdam optimizer, a novel adaptation of the well-regarded Adam algorithm. Central to StochGradAdam is its gradient sampling technique. This method not only ensures stable convergence but also leverages the advantages of selective gradient consideration, fostering robust training by potentially mitigating the effects of noisy or outlier data and enhancing the exploration of the loss landscape for more dependable convergence. In both image classification and segmentation tasks, StochGradAdam has demonstrated superior performance compared to the traditional Adam optimizer. By judiciously sampling a subset of gradients at each iteration, the optimizer is optimized for managing intricate models. The paper provides a comprehensive exploration of StochGradAdam's methodology, from its mathematical foundations to bias correction strategies, heralding a promising advancement in deep learning training techniques.
    FedPEAT: Convergence of Federated Learning, Parameter-Efficient Fine Tuning, and Emulator Assisted Tuning for Artificial Intelligence Foundation Models with Mobile Edge Computing. (arXiv:2310.17491v1 [cs.LG])
    The emergence of foundation models, including language and vision models, has reshaped AI's landscape, offering capabilities across various applications. Deploying and fine-tuning these large models, like GPT-3 and BERT, presents challenges, especially in the current foundation model era. We introduce Emulator-Assisted Tuning (EAT) combined with Parameter-Efficient Fine-Tuning (PEFT) to form Parameter-Efficient Emulator-Assisted Tuning (PEAT). Further, we expand this into federated learning as Federated PEAT (FedPEAT). FedPEAT uses adapters, emulators, and PEFT for federated model tuning, enhancing model privacy and memory efficiency. Adapters adjust pre-trained models, while emulators give a compact representation of original models, addressing both privacy and efficiency. Adaptable to various neural networks, our approach also uses deep reinforcement learning for hyper-parameter optimization. We tested FedPEAT in a unique scenario with a server participating in collaborative federated tuning, showcasing its potential in tackling foundation model challenges.
    IDENAS: Internal Dependency Exploration for Neural Architecture Search. (arXiv:2310.17250v1 [cs.LG])
    Machine learning is a powerful tool for extracting valuable information and making various predictions from diverse datasets. Traditional algorithms rely on well-defined input and output variables however, there are scenarios where the distinction between the input and output variables and the underlying, associated (input and output) layers of the model, are unknown. Neural Architecture Search (NAS) and Feature Selection have emerged as promising solutions in such scenarios. This research proposes IDENAS, an Internal Dependency-based Exploration for Neural Architecture Search, integrating NAS with feature selection. The methodology explores internal dependencies in the complete parameter space for classification involving 1D sensor and 2D image data as well. IDENAS employs a modified encoder-decoder model and the Sequential Forward Search (SFS) algorithm, combining input-output configuration search with embedded feature selection. Experimental results demonstrate IDENASs superior performance in comparison to other algorithms, showcasing its effectiveness in model development pipelines and automated machine learning. On average, IDENAS achieved significant modelling improvements, underscoring its significant contribution to advancing the state-of-the-art in neural architecture search and feature selection integration.
    Diagnosing Alzheimer's Disease using Early-Late Multimodal Data Fusion with Jacobian Maps. (arXiv:2310.16936v1 [cs.CV])
    Alzheimer's disease (AD) is a prevalent and debilitating neurodegenerative disorder impacting a large aging population. Detecting AD in all its presymptomatic and symptomatic stages is crucial for early intervention and treatment. An active research direction is to explore machine learning methods that harness multimodal data fusion to outperform human inspection of medical scans. However, existing multimodal fusion models have limitations, including redundant computation, complex architecture, and simplistic handling of missing data. Moreover, the preprocessing pipelines of medical scans remain inadequately detailed and are seldom optimized for individual subjects. In this paper, we propose an efficient early-late fusion (ELF) approach, which leverages a convolutional neural network for automated feature extraction and random forests for their competitive performance on small datasets. Additionally, we introduce a robust preprocessing pipeline that adapts to the unique characteristics of individual subjects and makes use of whole brain images rather than slices or patches. Moreover, to tackle the challenge of detecting subtle changes in brain volume, we transform images into the Jacobian domain (JD) to enhance both accuracy and robustness in our classification. Using MRI and CT images from the OASIS-3 dataset, our experiments demonstrate the effectiveness of the ELF approach in classifying AD into four stages with an accuracy of 97.19%.
    Enhancing Energy-efficiency by Solving the Throughput Bottleneck of LSTM Cells for Embedded FPGAs. (arXiv:2310.16842v1 [cs.AR])
    To process sensor data in the Internet of Things(IoTs), embedded deep learning for 1-dimensional data is an important technique. In the past, CNNs were frequently used because they are simple to optimise for special embedded hardware such as FPGAs. This work proposes a novel LSTM cell optimisation aimed at energy-efficient inference on end devices. Using the traffic speed prediction as a case study, a vanilla LSTM model with the optimised LSTM cell achieves 17534 inferences per second while consuming only 3.8 $\mu$J per inference on the FPGA \textit{XC7S15} from \textit{Spartan-7} family. It achieves at least 5.4$\times$ faster throughput and 1.37$\times$ more energy efficient than existing approaches.
    Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers. (arXiv:2305.14858v2 [cs.LG] UPDATED)
    Transformers have achieved great success in machine learning applications. Normalization techniques, such as Layer Normalization (LayerNorm, LN) and Root Mean Square Normalization (RMSNorm), play a critical role in accelerating and stabilizing the training of Transformers. While LayerNorm recenters and rescales input vectors, RMSNorm only rescales the vectors by their RMS value. Despite being more computationally efficient, RMSNorm may compromise the representation ability of Transformers. There is currently no consensus regarding the preferred normalization technique, as some models employ LayerNorm while others utilize RMSNorm, especially in recent large language models. It is challenging to convert Transformers with one normalization to the other type. While there is an ongoing disagreement between the two normalization types, we propose a solution to unify two mainstream Transformer architectures, Pre-LN and Pre-RMSNorm Transformers. By removing the inherent redundant mean information in the main branch of Pre-LN Transformers, we can reduce LayerNorm to RMSNorm, achieving higher efficiency. We further propose the Compressed RMSNorm (CRMSNorm) and Pre-CRMSNorm Transformer based on a lossless compression of the zero-mean vectors. We formally establish the equivalence of Pre-LN, Pre-RMSNorm, and Pre-CRMSNorm Transformer variants in both training and inference. It implies that Pre-LN Transformers can be substituted with Pre-(C)RMSNorm counterparts at almost no cost, offering the same arithmetic functionality along with free efficiency improvement. Experiments demonstrate that we can reduce the training and inference time of Pre-LN Transformers by 1% - 10%.
    Can Differentiable Decision Trees Learn Interpretable Reward Functions?. (arXiv:2306.13004v3 [cs.LG] UPDATED)
    There is an increasing interest in learning reward functions that model human preferences. However, many frameworks use blackbox learning methods that, while expressive, are difficult to interpret. We propose and evaluate a novel approach for learning expressive and interpretable reward functions from preferences using Differentiable Decision Trees (DDTs). Our experiments across several domains, including CartPole, Visual Gridworld environments and Atari games, provide evidence that that the tree structure of our learned reward function is useful in determining the extent to which the reward function is aligned with human preferences. We provide experimental evidence that reward DDTs can achieve competitive performance when compared with larger capacity deep neural network reward functions. We also observe that the choice between soft and hard (argmax) output of reward DDT reveals a tension between wanting highly shaped rewards to ensure good RL performance, while also wanting simpler, more interpretable rewards.
    Sky Imager-Based Forecast of Solar Irradiance Using Machine Learning. (arXiv:2310.17356v1 [cs.CV])
    Ahead-of-time forecasting of the output power of power plants is essential for the stability of the electricity grid and ensuring uninterrupted service. However, forecasting renewable energy sources is difficult due to the chaotic behavior of natural energy sources. This paper presents a new approach to estimate short-term solar irradiance from sky images. The~proposed algorithm extracts features from sky images and use learning-based techniques to estimate the solar irradiance. The~performance of proposed machine learning (ML) algorithm is evaluated using two publicly available datasets of sky images. The~datasets contain over 350,000 images for an interval of 16 years, from 2004 to 2020, with the corresponding global horizontal irradiance (GHI) of each image as the ground truth. Compared to the state-of-the-art computationally heavy algorithms proposed in the literature, our approach achieves competitive results with much less computational complexity for both nowcasting and forecasting up to 4 h ahead of time.
    Unleashing the potential of GNNs via Bi-directional Knowledge Transfer. (arXiv:2310.17132v1 [cs.LG])
    Based on the message-passing paradigm, there has been an amount of research proposing diverse and impressive feature propagation mechanisms to improve the performance of GNNs. However, less focus has been put on feature transformation, another major operation of the message-passing framework. In this paper, we first empirically investigate the performance of the feature transformation operation in several typical GNNs. Unexpectedly, we notice that GNNs do not completely free up the power of the inherent feature transformation operation. By this observation, we propose the Bi-directional Knowledge Transfer (BiKT), a plug-and-play approach to unleash the potential of the feature transformation operations without modifying the original architecture. Taking the feature transformation operation as a derived representation learning model that shares parameters with the original GNN, the direct prediction by this model provides a topological-agnostic knowledge feedback that can further instruct the learning of GNN and the feature transformations therein. On this basis, BiKT not only allows us to acquire knowledge from both the GNN and its derived model but promotes each other by injecting the knowledge into the other. In addition, a theoretical analysis is further provided to demonstrate that BiKT improves the generalization bound of the GNNs from the perspective of domain adaption. An extensive group of experiments on up to 7 datasets with 5 typical GNNs demonstrates that BiKT brings up to 0.5% - 4% performance gain over the original GNN, which means a boosted GNN is obtained. Meanwhile, the derived model also shows a powerful performance to compete with or even surpass the original GNN, enabling us to flexibly apply it independently to some other specific downstream tasks.
    Effective Targeted Attacks for Adversarial Self-Supervised Learning. (arXiv:2210.10482v2 [cs.LG] UPDATED)
    Recently, unsupervised adversarial training (AT) has been highlighted as a means of achieving robustness in models without any label information. Previous studies in unsupervised AT have mostly focused on implementing self-supervised learning (SSL) frameworks, which maximize the instance-wise classification loss to generate adversarial examples. However, we observe that simply maximizing the self-supervised training loss with an untargeted adversarial attack often results in generating ineffective adversaries that may not help improve the robustness of the trained model, especially for non-contrastive SSL frameworks without negative examples. To tackle this problem, we propose a novel positive mining for targeted adversarial attack to generate effective adversaries for adversarial SSL frameworks. Specifically, we introduce an algorithm that selects the most confusing yet similar target example for a given instance based on entropy and similarity, and subsequently perturbs the given instance towards the selected target. Our method demonstrates significant enhancements in robustness when applied to non-contrastive SSL frameworks, and less but consistent robustness improvements with contrastive SSL frameworks, on the benchmark datasets.
    Learning Space-Time Continuous Neural PDEs from Partially Observed States. (arXiv:2307.04110v2 [cs.LG] UPDATED)
    We introduce a novel grid-independent model for learning partial differential equations (PDEs) from noisy and partial observations on irregular spatiotemporal grids. We propose a space-time continuous latent neural PDE model with an efficient probabilistic framework and a novel encoder design for improved data efficiency and grid independence. The latent state dynamics are governed by a PDE model that combines the collocation method and the method of lines. We employ amortized variational inference for approximate posterior estimation and utilize a multiple shooting technique for enhanced training speed and stability. Our model demonstrates state-of-the-art performance on complex synthetic and real-world datasets, overcoming limitations of previous approaches and effectively handling partially-observed data. The proposed model outperforms recent methods, showing its potential to advance data-driven PDE modeling and enabling robust, grid-independent modeling of complex partially-observed dynamic processes.
    Learning an Inventory Control Policy with General Inventory Arrival Dynamics. (arXiv:2310.17168v1 [cs.LG])
    In this paper we address the problem of learning and backtesting inventory control policies in the presence of general arrival dynamics -- which we term as a quantity-over-time arrivals model (QOT). We also allow for order quantities to be modified as a post-processing step to meet vendor constraints such as order minimum and batch size constraints -- a common practice in real supply chains. To the best of our knowledge this is the first work to handle either arbitrary arrival dynamics or an arbitrary downstream post-processing of order quantities. Building upon recent work (Madeka et al., 2022) we similarly formulate the periodic review inventory control problem as an exogenous decision process, where most of the state is outside the control of the agent. Madeka et al. (2022) show how to construct a simulator that replays historic data to solve this class of problem. In our case, we incorporate a deep generative model for the arrivals process as part of the history replay. By formulating the problem as an exogenous decision process, we can apply results from Madeka et al. (2022) to obtain a reduction to supervised learning. Finally, we show via simulation studies that this approach yields statistically significant improvements in profitability over production baselines. Using data from an ongoing real-world A/B test, we show that Gen-QOT generalizes well to off-policy data.
    A Theory of Link Prediction via Relational Weisfeiler-Leman on Knowledge Graphs. (arXiv:2302.02209v4 [cs.LG] UPDATED)
    Graph neural networks are prominent models for representation learning over graph-structured data. While the capabilities and limitations of these models are well-understood for simple graphs, our understanding remains incomplete in the context of knowledge graphs. Our goal is to provide a systematic understanding of the landscape of graph neural networks for knowledge graphs pertaining to the prominent task of link prediction. Our analysis entails a unifying perspective on seemingly unrelated models and unlocks a series of other models. The expressive power of various models is characterized via a corresponding relational Weisfeiler-Leman algorithm. This analysis is extended to provide a precise logical characterization of the class of functions captured by a class of graph neural networks. The theoretical findings presented in this paper explain the benefits of some widely employed practical design choices, which are validated empirically.
    Exploring the Trie of Rules: a fast data structure for the representation of association rules. (arXiv:2310.17355v1 [cs.LG])
    Association rule mining techniques can generate a large volume of sequential data when implemented on transactional databases. Extracting insights from a large set of association rules has been found to be a challenging process. When examining a ruleset, the fundamental question is how to summarise and represent meaningful mined knowledge efficiently. Many algorithms and strategies have been developed to address issue of knowledge extraction; however, the effectiveness of this process can be limited by the data structures. A better data structure can sufficiently affect the speed of the knowledge extraction process. This paper proposes a novel data structure, called the Trie of rules, for storing a ruleset that is generated by association rule mining. The resulting data structure is a prefix-tree graph structure made of pre-mined rules. This graph stores the rules as paths within the prefix-tree in a way that similar rules overlay each other. Each node in the tree represents a rule where a consequent is this node, and an antecedent is a path from this node to the root of the tree. The evaluation showed that the proposed representation technique is promising. It compresses a ruleset with almost no data loss and benefits in terms of time for basic operations such as searching for a specific rule and sorting, which is the base for many knowledge discovery methods. Moreover, our method demonstrated a significant improvement in traversing time, achieving an 8-fold increase compared to traditional data structures.
    Universal Test-time Adaptation through Weight Ensembling, Diversity Weighting, and Prior Correction. (arXiv:2306.00650v2 [cs.CV] UPDATED)
    Since distribution shifts are likely to occur during test-time and can drastically decrease the model's performance, online test-time adaptation (TTA) continues to update the model after deployment, leveraging the current test data. Clearly, a method proposed for online TTA has to perform well for all kinds of environmental conditions. By introducing the variable factors domain non-stationarity and temporal correlation, we first unfold all practically relevant settings and define the entity as universal TTA. We want to highlight that this is the first work that covers such a broad spectrum, which is indispensable for the use in practice. To tackle the problem of universal TTA, we identify and highlight several challenges a self-training based method has to deal with: 1) model bias and the occurrence of trivial solutions when performing entropy minimization on varying sequence lengths with and without multiple domain shifts, 2) loss of generalization which exacerbates the adaptation to multiple domain shifts and the occurrence of catastrophic forgetting, and 3) performance degradation due to shifts in class prior. To prevent the model from becoming biased, we leverage a dataset and model-agnostic certainty and diversity weighting. In order to maintain generalization and prevent catastrophic forgetting, we propose to continually weight-average the source and adapted model. To compensate for disparities in the class prior during test-time, we propose an adaptive prior correction scheme that reweights the model's predictions. We evaluate our approach, named ROID, on a wide range of settings, datasets, and models, setting new standards in the field of universal TTA. Code is available at: https://github.com/mariodoebler/test-time-adaptation
    STEER: Semantic Turn Extension-Expansion Recognition for Voice Assistants. (arXiv:2310.16990v1 [cs.CL])
    In the context of a voice assistant system, steering refers to the phenomenon in which a user issues a follow-up command attempting to direct or clarify a previous turn. We propose STEER, a steering detection model that predicts whether a follow-up turn is a user's attempt to steer the previous command. Constructing a training dataset for steering use cases poses challenges due to the cold-start problem. To overcome this, we developed heuristic rules to sample opt-in usage data, approximating positive and negative samples without any annotation. Our experimental results show promising performance in identifying steering intent, with over 95% accuracy on our sampled data. Moreover, STEER, in conjunction with our sampling strategy, aligns effectively with real-world steering scenarios, as evidenced by its strong zero-shot performance on a human-graded evaluation set. In addition to relying solely on user transcripts as input, we introduce STEER+, an enhanced version of the model. STEER+ utilizes a semantic parse tree to provide more context on out-of-vocabulary words, such as named entities that often occur at the sentence boundary. This further improves model performance, reducing error rate in domains where entities frequently appear, such as messaging. Lastly, we present a data analysis that highlights the improvement in user experience when voice assistants support steering use cases.
    A Deep Learning Approach to Teeth Segmentation and Orientation from Panoramic X-rays. (arXiv:2310.17176v1 [cs.CV])
    Accurate teeth segmentation and orientation are fundamental in modern oral healthcare, enabling precise diagnosis, treatment planning, and dental implant design. In this study, we present a comprehensive approach to teeth segmentation and orientation from panoramic X-ray images, leveraging deep learning techniques. We build our model based on FUSegNet, a popular model originally developed for wound segmentation, and introduce modifications by incorporating grid-based attention gates into the skip connections. We introduce oriented bounding box (OBB) generation through principal component analysis (PCA) for precise tooth orientation estimation. Evaluating our approach on the publicly available DNS dataset, comprising 543 panoramic X-ray images, we achieve the highest Intersection-over-Union (IoU) score of 82.43% and Dice Similarity Coefficient (DSC) score of 90.37% among compared models in teeth instance segmentation. In OBB analysis, we obtain the Rotated IoU (RIoU) score of 82.82%. We also conduct detailed analyses of individual tooth labels and categorical performance, shedding light on strengths and weaknesses. The proposed model's accuracy and versatility offer promising prospects for improving dental diagnoses, treatment planning, and personalized healthcare in the oral domain. Our generated OBB coordinates and codes are available at https://github.com/mrinal054/Instance_teeth_segmentation.
    Emergent representations in networks trained with the Forward-Forward algorithm. (arXiv:2305.18353v2 [cs.NE] UPDATED)
    The Backpropagation algorithm has often been criticised for its lack of biological realism. In an attempt to find a more biologically plausible alternative, the recently introduced Forward-Forward algorithm replaces the forward and backward passes of Backpropagation with two forward passes. In this work, we show that the internal representations obtained by the Forward-Forward algorithm can organise into category-specific ensembles exhibiting high sparsity - i.e. composed of an extremely low number of active units. This situation is reminiscent of what has been observed in cortical sensory areas, where neuronal ensembles are suggested to serve as the functional building blocks for perception and action. Interestingly, while this sparse pattern does not typically arise in models trained with standard Backpropagation, it can emerge in networks trained with Backpropagation on the same objective proposed for the Forward-Forward algorithm. These results suggest that the learning procedure proposed by Forward-Forward may be superior to Backpropagation in modelling learning in the cortex, even when a backward pass is used.
    Unconstrained Dynamic Regret via Sparse Coding. (arXiv:2301.13349v5 [cs.LG] UPDATED)
    Motivated by the challenge of nonstationarity in sequential decision making, we study Online Convex Optimization (OCO) under the coupling of two problem structures: the domain is unbounded, and the comparator sequence $u_1,\ldots,u_T$ is arbitrarily time-varying. As no algorithm can guarantee low regret simultaneously against all comparator sequences, handling this setting requires moving from minimax optimality to comparator adaptivity. That is, sensible regret bounds should depend on certain complexity measures of the comparator relative to one's prior knowledge. This paper achieves a new type of these adaptive regret bounds via a sparse coding framework. The complexity of the comparator is measured by its energy and its sparsity on a user-specified dictionary, which offers considerable versatility. Equipped with a wavelet dictionary for example, our framework improves the state-of-the-art bound (Jacobsen & Cutkosky, 2022) by adapting to both ($i$) the magnitude of the comparator average $||\bar u||=||\sum_{t=1}^Tu_t/T||$, rather than the maximum $\max_t||u_t||$; and ($ii$) the comparator variability $\sum_{t=1}^T||u_t-\bar u||$, rather than the uncentered sum $\sum_{t=1}^T||u_t||$. Furthermore, our analysis is simpler due to decoupling function approximation from regret minimization.
    An Optimal and Scalable Matrix Mechanism for Noisy Marginals under Convex Loss Functions. (arXiv:2305.08175v2 [cs.DB] UPDATED)
    Noisy marginals are a common form of confidentiality-protecting data release and are useful for many downstream tasks such as contingency table analysis, construction of Bayesian networks, and even synthetic data generation. Privacy mechanisms that provide unbiased noisy answers to linear queries (such as marginals) are known as matrix mechanisms. We propose ResidualPlanner, a matrix mechanism for marginals with Gaussian noise that is both optimal and scalable. ResidualPlanner can optimize for many loss functions that can be written as a convex function of marginal variances (prior work was restricted to just one predefined objective function). ResidualPlanner can optimize the accuracy of marginals in large scale settings in seconds, even when the previous state of the art (HDMM) runs out of memory. It even runs on datasets with 100 attributes in a couple of minutes. Furthermore ResidualPlanner can efficiently compute variance/covariance values for each marginal (prior methods quickly run out of memory, even for relatively small datasets).
    Efficient Diffusion Policies for Offline Reinforcement Learning. (arXiv:2305.20081v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets, where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model, whose success relies on a parametrized Markov Chain with hundreds of steps for sampling. However, Diffusion-QL suffers from two critical limitations. 1) It is computationally inefficient to forward and backward through the whole Markov chain during training. 2) It is incompatible with maximum likelihood-based RL algorithms (e.g., policy gradient methods) as the likelihood of diffusion models is intractable. Therefore, we propose efficient diffusion policy (EDP) to overcome these two challenges. EDP approximately constructs actions from corrupted ones at training to avoid running the sampling chain. We conduct extensive experiments on the D4RL benchmark. The results show that EDP can reduce the diffusion policy training time from 5 days to 5 hours on gym-locomotion tasks. Moreover, we show that EDP is compatible with various offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on D4RL by large margins over previous methods. Our code is available at https://github.com/sail-sg/edp.
    Multitask Online Learning: Listen to the Neighborhood Buzz. (arXiv:2310.17385v1 [cs.LG])
    We study multitask online learning in a setting where agents can only exchange information with their neighbors on an arbitrary communication network. We introduce $\texttt{MT-CO}_2\texttt{OL}$, a decentralized algorithm for this setting whose regret depends on the interplay between the task similarities and the network structure. Our analysis shows that the regret of $\texttt{MT-CO}_2\texttt{OL}$ is never worse (up to constants) than the bound obtained when agents do not share information. On the other hand, our bounds significantly improve when neighboring agents operate on similar tasks. In addition, we prove that our algorithm can be made differentially private with a negligible impact on the regret when the losses are linear. Finally, we provide experimental support for our theory.
    Sketch-and-Project Meets Newton Method: Global $\mathcal O(k^{-2})$ Convergence with Low-Rank Updates. (arXiv:2305.13082v2 [math.OC] UPDATED)
    In this paper, we propose the first sketch-and-project Newton method with fast $\mathcal O(k^{-2})$ global convergence rate for self-concordant functions. Our method, SGN, can be viewed in three ways: i) as a sketch-and-project algorithm projecting updates of Newton method, ii) as a cubically regularized Newton ethod in sketched subspaces, and iii) as a damped Newton method in sketched subspaces. SGN inherits best of all three worlds: cheap iteration costs of sketch-and-project methods, state-of-the-art $\mathcal O(k^{-2})$ global convergence rate of full-rank Newton-like methods and the algorithm simplicity of damped Newton methods. Finally, we demonstrate its comparable empirical performance to baseline algorithms.
    Curvature Filtrations for Graph Generative Model Evaluation. (arXiv:2301.12906v3 [cs.LG] UPDATED)
    Graph generative model evaluation necessitates understanding differences between graphs on the distributional level. This entails being able to harness salient attributes of graphs in an efficient manner. Curvature constitutes one such property that has recently proved its utility in characterising graphs. Its expressive properties, stability, and practical utility in model evaluation remain largely unexplored, however. We combine graph curvature descriptors with emerging methods from topological data analysis to obtain robust, expressive descriptors for evaluating graph generative models.
    Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. (arXiv:2307.07063v3 [cs.CV] UPDATED)
    We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language (VL) pre-training. The current paradigm uses visual features as prompts to guide language models, with a focus on determining the most relevant visual features for corresponding text. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. We introduce the Prompt-Transformer (P-Former), a model that predicts these ideal prompts, which is trained exclusively on linguistic data, bypassing the need for image-text pairings. This strategy subtly bifurcates the end-to-end VL training process into an additional, separate stage. Our experiments reveal that our framework significantly enhances the performance of a robust image-to-text baseline (BLIP-2), and effectively narrows the performance gap between models trained with either 4M or 129M image-text pairs. Importantly, our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task using varied base modules. The code will be made available at https://github.com/yiren-jian/BLIText.
    Counterfactual-Augmented Importance Sampling for Semi-Offline Policy Evaluation. (arXiv:2310.17146v1 [cs.LG])
    In applying reinforcement learning (RL) to high-stakes domains, quantitative and qualitative evaluation using observational data can help practitioners understand the generalization performance of new policies. However, this type of off-policy evaluation (OPE) is inherently limited since offline data may not reflect the distribution shifts resulting from the application of new policies. On the other hand, online evaluation by collecting rollouts according to the new policy is often infeasible, as deploying new policies in these domains can be unsafe. In this work, we propose a semi-offline evaluation framework as an intermediate step between offline and online evaluation, where human users provide annotations of unobserved counterfactual trajectories. While tempting to simply augment existing data with such annotations, we show that this naive approach can lead to biased results. Instead, we design a new family of OPE estimators based on importance sampling (IS) and a novel weighting scheme that incorporate counterfactual annotations without introducing additional bias. We analyze the theoretical properties of our approach, showing its potential to reduce both bias and variance compared to standard IS estimators. Our analyses reveal important practical considerations for handling biased, noisy, or missing annotations. In a series of proof-of-concept experiments involving bandits and a healthcare-inspired simulator, we demonstrate that our approach outperforms purely offline IS estimators and is robust to imperfect annotations. Our framework, combined with principled human-centered design of annotation solicitation, can enable the application of RL in high-stakes domains.
    Sequential Memory with Temporal Predictive Coding. (arXiv:2305.11982v2 [q-bio.NC] UPDATED)
    Forming accurate memory of sequential stimuli is a fundamental function of biological agents. However, the computational mechanism underlying sequential memory in the brain remains unclear. Inspired by neuroscience theories and recent successes in applying predictive coding (PC) to \emph{static} memory tasks, in this work we propose a novel PC-based model for \emph{sequential} memory, called \emph{temporal predictive coding} (tPC). We show that our tPC models can memorize and retrieve sequential inputs accurately with a biologically plausible neural implementation. Importantly, our analytical study reveals that tPC can be viewed as a classical Asymmetric Hopfield Network (AHN) with an implicit statistical whitening process, which leads to more stable performance in sequential memory tasks of structured inputs. Moreover, we find that tPC exhibits properties consistent with behavioral observations and theories in neuroscience, thereby strengthening its biological relevance. Our work establishes a possible computational mechanism underlying sequential memory in the brain that can also be theoretically interpreted using existing memory model frameworks.
    Taming Gradient Variance in Federated Learning with Networked Control Variates. (arXiv:2310.17200v1 [cs.LG])
    Federated learning, a decentralized approach to machine learning, faces significant challenges such as extensive communication overheads, slow convergence, and unstable improvements. These challenges primarily stem from the gradient variance due to heterogeneous client data distributions. To address this, we introduce a novel Networked Control Variates (FedNCV) framework for Federated Learning. We adopt the REINFORCE Leave-One-Out (RLOO) as a fundamental control variate unit in the FedNCV framework, implemented at both client and server levels. At the client level, the RLOO control variate is employed to optimize local gradient updates, mitigating the variance introduced by data samples. Once relayed to the server, the RLOO-based estimator further provides an unbiased and low-variance aggregated gradient, leading to robust global updates. This dual-side application is formalized as a linear combination of composite control variates. We provide a mathematical expression capturing this integration of double control variates within FedNCV and present three theoretical results with corresponding proofs. This unique dual structure equips FedNCV to address data heterogeneity and scalability issues, thus potentially paving the way for large-scale applications. Moreover, we tested FedNCV on six diverse datasets under a Dirichlet distribution with {\alpha} = 0.1, and benchmarked its performance against six SOTA methods, demonstrating its superiority.
    On the Identifiability and Interpretability of Gaussian Process Models. (arXiv:2310.17023v1 [stat.ML])
    In this paper, we critically examine the prevalent practice of using additive mixtures of Mat\'ern kernels in single-output Gaussian process (GP) models and explore the properties of multiplicative mixtures of Mat\'ern kernels for multi-output GP models. For the single-output case, we derive a series of theoretical results showing that the smoothness of a mixture of Mat\'ern kernels is determined by the least smooth component and that a GP with such a kernel is effectively equivalent to the least smooth kernel component. Furthermore, we demonstrate that none of the mixing weights or parameters within individual kernel components are identifiable. We then turn our attention to multi-output GP models and analyze the identifiability of the covariance matrix $A$ in the multiplicative kernel $K(x,y) = AK_0(x,y)$, where $K_0$ is a standard single output kernel such as Mat\'ern. We show that $A$ is identifiable up to a multiplicative constant, suggesting that multiplicative mixtures are well suited for multi-output tasks. Our findings are supported by extensive simulations and real applications for both single- and multi-output settings. This work provides insight into kernel selection and interpretation for GP models, emphasizing the importance of choosing appropriate kernel structures for different tasks.
    Quantum Long Short-Term Memory (QLSTM) vs Classical LSTM in Time Series Forecasting: A Comparative Study in Solar Power Forecasting. (arXiv:2310.17032v1 [quant-ph])
    Accurately forecasting solar power generation is crucial in the global progression towards sustainable energy systems. In this study, we conduct a meticulous comparison between Quantum Long Short-Term Memory (QLSTM) and classical Long Short-Term Memory (LSTM) models for solar power production forecasting. Our controlled experiments reveal promising advantages of QLSTMs, including accelerated training convergence and substantially reduced test loss within the initial epoch compared to classical LSTMs. These empirical findings demonstrate QLSTM's potential to swiftly assimilate complex time series relationships, enabled by quantum phenomena like superposition. However, realizing QLSTM's full capabilities necessitates further research into model validation across diverse conditions, systematic hyperparameter optimization, hardware noise resilience, and applications to correlated renewable forecasting problems. With continued progress, quantum machine learning can offer a paradigm shift in renewable energy time series prediction. This pioneering work provides initial evidence substantiating quantum advantages over classical LSTM, while acknowledging present limitations. Through rigorous benchmarking grounded in real-world data, our study elucidates a promising trajectory for quantum learning in renewable forecasting. Additional research and development can further actualize this potential to achieve unprecedented accuracy and reliability in predicting solar power generation worldwide.
    An Explainable Deep Learning-Based Method For Schizophrenia Diagnosis Using Generative Data-Augmentation. (arXiv:2310.16867v1 [cs.LG])
    In this study, we leverage a deep learning-based method for the automatic diagnosis of schizophrenia using EEG brain recordings. This approach utilizes generative data augmentation, a powerful technique that enhances the accuracy of the diagnosis. To enable the utilization of time-frequency features, spectrograms were extracted from the raw signals. After exploring several neural network architectural setups, a proper convolutional neural network (CNN) was used for the initial diagnosis. Subsequently, using Wasserstein GAN with Gradient Penalty (WGAN-GP) and Variational Autoencoder (VAE), two different synthetic datasets were generated in order to augment the initial dataset and address the over-fitting issue. The augmented dataset using VAE achieved a 3.0\% improvement in accuracy reaching up to 99.0\% and yielded a lower loss value as well as a faster convergence. Finally, we addressed the lack of trust in black-box models using the Local Interpretable Model-agnostic Explanations (LIME) algorithm to determine the most important superpixels (frequencies) in the diagnosis process.
    The Significance of Machine Learning in Clinical Disease Diagnosis: A Review. (arXiv:2310.16978v1 [cs.LG])
    The global need for effective disease diagnosis remains substantial, given the complexities of various disease mechanisms and diverse patient symptoms. To tackle these challenges, researchers, physicians, and patients are turning to machine learning (ML), an artificial intelligence (AI) discipline, to develop solutions. By leveraging sophisticated ML and AI methods, healthcare stakeholders gain enhanced diagnostic and treatment capabilities. However, there is a scarcity of research focused on ML algorithms for enhancing the accuracy and computational efficiency. This research investigates the capacity of machine learning algorithms to improve the transmission of heart rate data in time series healthcare metrics, concentrating particularly on optimizing accuracy and efficiency. By exploring various ML algorithms used in healthcare applications, the review presents the latest trends and approaches in ML-based disease diagnosis (MLBDD). The factors under consideration include the algorithm utilized, the types of diseases targeted, the data types employed, the applications, and the evaluation metrics. This review aims to shed light on the prospects of ML in healthcare, particularly in disease diagnosis. By analyzing the current literature, the study provides insights into state-of-the-art methodologies and their performance metrics.
    Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control. (arXiv:2309.14597v2 [cs.LG] UPDATED)
    Deep reinforcement learning agents for continuous control are known to exhibit significant instability in their performance over time. In this work, we provide a fresh perspective on these behaviors by studying the return landscape: the mapping between a policy and a return. We find that popular algorithms traverse noisy neighborhoods of this landscape, in which a single update to the policy parameters leads to a wide range of returns. By taking a distributional view of these returns, we map the landscape, characterizing failure-prone regions of policy space and revealing a hidden dimension of policy quality. We show that the landscape exhibits surprising structure by finding simple paths in parameter space which improve the stability of a policy. To conclude, we develop a distribution-aware procedure which finds such paths, navigating away from noisy neighborhoods in order to improve the robustness of a policy. Taken together, our results provide new insight into the optimization, evaluation, and design of agents.
    Cause-Effect Inference in Location-Scale Noise Models: Maximum Likelihood vs. Independence Testing. (arXiv:2301.12930v3 [cs.LG] UPDATED)
    A fundamental problem of causal discovery is cause-effect inference, learning the correct causal direction between two random variables. Significant progress has been made through modelling the effect as a function of its cause and a noise term, which allows us to leverage assumptions about the generating function class. The recently introduced heteroscedastic location-scale noise functional models (LSNMs) combine expressive power with identifiability guarantees. LSNM model selection based on maximizing likelihood achieves state-of-the-art accuracy, when the noise distributions are correctly specified. However, through an extensive empirical evaluation, we demonstrate that the accuracy deteriorates sharply when the form of the noise distribution is misspecified by the user. Our analysis shows that the failure occurs mainly when the conditional variance in the anti-causal direction is smaller than that in the causal direction. As an alternative, we find that causal model selection through residual independence testing is much more robust to noise misspecification and misleading conditional variance.
    Local Advantage Networks for Cooperative Multi-Agent Reinforcement Learning. (arXiv:2112.12458v3 [cs.LG] UPDATED)
    Many recent successful off-policy multi-agent reinforcement learning (MARL) algorithms for cooperative partially observable environments focus on finding factorized value functions, leading to convoluted network structures. Building on the structure of independent Q-learners, our LAN algorithm takes a radically different approach, leveraging a dueling architecture to learn for each agent a decentralized best-response policies via individual advantage functions. The learning is stabilized by a centralized critic whose primary objective is to reduce the moving target problem of the individual advantages. The critic, whose network's size is independent of the number of agents, is cast aside after learning. Evaluation on the StarCraft II multi-agent challenge benchmark shows that LAN reaches state-of-the-art performance and is highly scalable with respect to the number of agents, opening up a promising alternative direction for MARL research.
    Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual Bandits. (arXiv:2306.07923v2 [cs.LG] UPDATED)
    We consider offline policy optimization (OPO) in contextual bandits, where one is given a fixed dataset of logged interactions. While pessimistic regularizers are typically used to mitigate distribution shift, prior implementations thereof are either specialized or computationally inefficient. We present the first general oracle-efficient algorithm for pessimistic OPO: it reduces to supervised learning, leading to broad applicability. We obtain statistical guarantees analogous to those for prior pessimistic approaches. We instantiate our approach for both discrete and continuous actions and perform experiments in both settings, showing advantage over unregularized OPO across a wide range of configurations.
    BEVContrast: Self-Supervision in BEV Space for Automotive Lidar Point Clouds. (arXiv:2310.17281v1 [cs.CV])
    We present a surprisingly simple and efficient method for self-supervision of 3D backbone on automotive Lidar point clouds. We design a contrastive loss between features of Lidar scans captured in the same scene. Several such approaches have been proposed in the literature from PointConstrast, which uses a contrast at the level of points, to the state-of-the-art TARL, which uses a contrast at the level of segments, roughly corresponding to objects. While the former enjoys a great simplicity of implementation, it is surpassed by the latter, which however requires a costly pre-processing. In BEVContrast, we define our contrast at the level of 2D cells in the Bird's Eye View plane. Resulting cell-level representations offer a good trade-off between the point-level representations exploited in PointContrast and segment-level representations exploited in TARL: we retain the simplicity of PointContrast (cell representations are cheap to compute) while surpassing the performance of TARL in downstream semantic segmentation.
    Transformer-based Atmospheric Density Forecasting. (arXiv:2310.16912v1 [physics.ao-ph])
    As the peak of the solar cycle approaches in 2025 and the ability of a single geomagnetic storm to significantly alter the orbit of Resident Space Objects (RSOs), techniques for atmospheric density forecasting are vital for space situational awareness. While linear data-driven methods, such as dynamic mode decomposition with control (DMDc), have been used previously for forecasting atmospheric density, deep learning-based forecasting has the ability to capture nonlinearities in data. By learning multiple layer weights from historical atmospheric density data, long-term dependencies in the dataset are captured in the mapping between the current atmospheric density state and control input to the atmospheric density state at the next timestep. This work improves upon previous linear propagation methods for atmospheric density forecasting, by developing a nonlinear transformer-based architecture for atmospheric density forecasting. Empirical NRLMSISE-00 and JB2008, as well as physics-based TIEGCM atmospheric density models are compared for forecasting with DMDc and with the transformer-based propagator.  ( 2 min )
    Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks. (arXiv:2310.16955v1 [cs.LG])
    Real-world natural language processing systems need to be robust to human adversaries. Collecting examples of human adversaries for training is an effective but expensive solution. On the other hand, training on synthetic attacks with small perturbations - such as word-substitution - does not actually improve robustness to human adversaries. In this paper, we propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale. We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets - both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1%$\,\to\,$50.1%) and on two future unseen rounds of human generated attacks (32.5%$\,\to\,$43.4%, and 29.4%$\,\to\,$40.2%). In hate speech detection, we see AUC gains on current attacks (0.76 $\to$ 0.84) and a future round (0.77 $\to$ 0.79). Attacks from methods that do not learn the distribution of existing human adversaries, meanwhile, degrade robustness.  ( 2 min )
    Probabilistic Integral Circuits. (arXiv:2310.16986v1 [cs.LG])
    Continuous latent variables (LVs) are a key ingredient of many generative models, as they allow modelling expressive mixtures with an uncountable number of components. In contrast, probabilistic circuits (PCs) are hierarchical discrete mixtures represented as computational graphs composed of input, sum and product units. Unlike continuous LV models, PCs provide tractable inference but are limited to discrete LVs with categorical (i.e. unordered) states. We bridge these model classes by introducing probabilistic integral circuits (PICs), a new language of computational graphs that extends PCs with integral units representing continuous LVs. In the first place, PICs are symbolic computational graphs and are fully tractable in simple cases where analytical integration is possible. In practice, we parameterise PICs with light-weight neural nets delivering an intractable hierarchical continuous mixture that can be approximated arbitrarily well with large PCs using numerical quadrature. On several distribution estimation benchmarks, we show that such PIC-approximating PCs systematically outperform PCs commonly learned via expectation-maximization or SGD.  ( 2 min )
    Improvement in Alzheimer's Disease MRI Images Analysis by Convolutional Neural Networks Via Topological Optimization. (arXiv:2310.16857v1 [eess.IV])
    This research underscores the efficacy of Fourier topological optimization in refining MRI imagery, thereby bolstering the classification precision of Alzheimer's Disease through convolutional neural networks. Recognizing that MRI scans are indispensable for neurological assessments, but frequently grapple with issues like blurriness and contrast irregularities, the deployment of Fourier topological optimization offered enhanced delineation of brain structures, ameliorated noise, and superior contrast. The applied techniques prioritized boundary enhancement, contrast and brightness adjustments, and overall image lucidity. Employing CNN architectures VGG16, ResNet50, InceptionV3, and Xception, the post-optimization analysis revealed a marked elevation in performance. Conclusively, the amalgamation of Fourier topological optimization with CNNs delineates a promising trajectory for the nuanced classification of Alzheimer's Disease, portending a transformative impact on its diagnostic paradigms.  ( 2 min )
    Squared Neural Families: A New Class of Tractable Density Models. (arXiv:2305.13552v2 [cs.LG] UPDATED)
    Flexible models for probability distributions are an essential ingredient in many machine learning tasks. We develop and investigate a new class of probability distributions, which we call a Squared Neural Family (SNEFY), formed by squaring the 2-norm of a neural network and normalising it with respect to a base measure. Following the reasoning similar to the well established connections between infinitely wide neural networks and Gaussian processes, we show that SNEFYs admit closed form normalising constants in many cases of interest, thereby resulting in flexible yet fully tractable density models. SNEFYs strictly generalise classical exponential families, are closed under conditioning, and have tractable marginal distributions. Their utility is illustrated on a variety of density estimation, conditional density estimation, and density estimation with missing data tasks.
    MimicTouch: Learning Human's Control Strategy with Multi-Modal Tactile Feedback. (arXiv:2310.16917v1 [cs.RO])
    In robotics and artificial intelligence, the integration of tactile processing is becoming increasingly pivotal, especially in learning to execute intricate tasks like alignment and insertion. However, existing works focusing on tactile methods for insertion tasks predominantly rely on robot teleoperation data and reinforcement learning, which do not utilize the rich insights provided by human's control strategy guided by tactile feedback. For utilizing human sensations, methodologies related to learning from humans predominantly leverage visual feedback, often overlooking the invaluable tactile feedback that humans inherently employ to finish complex manipulations. Addressing this gap, we introduce "MimicTouch", a novel framework that mimics human's tactile-guided control strategy. In this framework, we initially collect multi-modal tactile datasets from human demonstrators, incorporating human tactile-guided control strategies for task completion. The subsequent step involves instructing robots through imitation learning using multi-modal sensor data and retargeted human motions. To further mitigate the embodiment gap between humans and robots, we employ online residual reinforcement learning on the physical robot. Through comprehensive experiments, we validate the safety of MimicTouch in transferring a latent policy learned through imitation learning from human to robot. This ongoing work will pave the way for a broader spectrum of tactile-guided robotic applications.  ( 2 min )
    Zephyr: Direct Distillation of LM Alignment. (arXiv:2310.16944v1 [cs.LG])
    We aim to produce a smaller language model that is aligned to user intent. Previous research has shown that applying distilled supervised fine-tuning (dSFT) on larger models significantly improves task accuracy; however, these models are unaligned, i.e. they do not respond well to natural prompts. To distill this property, we experiment with the use of preference data from AI Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models, and requires no human annotation. In particular, results on MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access RLHF-based model. Code, models, data, and tutorials for the system are available at https://github.com/huggingface/alignment-handbook.  ( 2 min )
    Improving Few-shot Generalization of Safety Classifiers via Data Augmented Parameter-Efficient Fine-Tuning. (arXiv:2310.16959v1 [cs.LG])
    As large language models (LLMs) are widely adopted, new safety issues and policies emerge, to which existing safety classifiers do not generalize well. If we have only observed a few examples of violations of a new safety rule, how can we build a classifier to detect violations? In this paper, we study the novel setting of domain-generalized few-shot learning for LLM-based text safety classifiers. Unlike prior few-shot work, these new safety issues can be hard to uncover and we do not get to choose the few examples. We demonstrate that existing few-shot techniques do not perform well in this setting, and rather we propose to do parameter-efficient fine-tuning (PEFT) combined with augmenting training data based on similar examples in prior existing rules. We empirically show that our approach of similarity-based data-augmentation + prompt-tuning (DAPT) consistently outperforms baselines that either do not rely on data augmentation or on PEFT by 7-17% F1 score in the Social Chemistry moral judgement and 9-13% AUC in the Toxicity detection tasks, even when the new rule is loosely correlated with existing ones.  ( 2 min )
    Causal Q-Aggregation for CATE Model Selection. (arXiv:2310.16945v1 [stat.ML])
    Accurate estimation of conditional average treatment effects (CATE) is at the core of personalized decision making. While there is a plethora of models for CATE estimation, model selection is a nontrivial task, due to the fundamental problem of causal inference. Recent empirical work provides evidence in favor of proxy loss metrics with double robust properties and in favor of model ensembling. However, theoretical understanding is lacking. Direct application of prior theoretical work leads to suboptimal oracle model selection rates due to the non-convexity of the model selection problem. We provide regret rates for the major existing CATE ensembling approaches and propose a new CATE model ensembling approach based on Q-aggregation using the doubly robust loss. Our main result shows that causal Q-aggregation achieves statistically optimal oracle model selection regret rates of $\frac{\log(M)}{n}$ (with $M$ models and $n$ samples), with the addition of higher-order estimation error terms related to products of errors in the nuisance functions. Crucially, our regret rate does not require that any of the candidate CATE models be close to the truth. We validate our new method on many semi-synthetic datasets and also provide extensions of our work to CATE model selection with instrumental variables and unobserved confounding.  ( 2 min )
    Towards Continually Learning Application Performance Models. (arXiv:2310.16996v1 [cs.LG])
    Machine learning-based performance models are increasingly being used to build critical job scheduling and application optimization decisions. Traditionally, these models assume that data distribution does not change as more samples are collected over time. However, owing to the complexity and heterogeneity of production HPC systems, they are susceptible to hardware degradation, replacement, and/or software patches, which can lead to drift in the data distribution that can adversely affect the performance models. To this end, we develop continually learning performance models that account for the distribution drift, alleviate catastrophic forgetting, and improve generalizability. Our best model was able to retain accuracy, regardless of having to learn the new distribution of data inflicted by system changes, while demonstrating a 2x improvement in the prediction accuracy of the whole data sequence in comparison to the naive approach.  ( 2 min )
    Exploring Behavior Discovery Methods for Heterogeneous Swarms of Limited-Capability Robots. (arXiv:2310.16941v1 [cs.RO])
    We study the problem of determining the emergent behaviors that are possible given a functionally heterogeneous swarm of robots with limited capabilities. Prior work has considered behavior search for homogeneous swarms and proposed the use of novelty search over either a hand-specified or learned behavior space followed by clustering to return a taxonomy of emergent behaviors to the user. In this paper, we seek to better understand the role of novelty search and the efficacy of using clustering to discover novel emergent behaviors. Through a large set of experiments and ablations, we analyze the effect of representations, evolutionary search, and various clustering methods in the search for novel behaviors in a heterogeneous swarm. Our results indicate that prior methods fail to discover many interesting behaviors and that an iterative human-in-the-loop discovery process discovers more behaviors than random search, swarm chemistry, and automated behavior discovery. The combined discoveries of our experiments uncover 23 emergent behaviors, 18 of which are novel discoveries. To the best of our knowledge, these are the first known emergent behaviors for heterogeneous swarms of computation-free agents. Videos, code, and appendix are available at the project website: https://sites.google.com/view/heterogeneous-bd-methods  ( 2 min )
    General Point Model with Autoencoding and Autoregressive. (arXiv:2310.16861v1 [cs.LG])
    The pre-training architectures of large language models encompass various types, including autoencoding models, autoregressive models, and encoder-decoder models. We posit that any modality can potentially benefit from a large language model, as long as it undergoes vector quantization to become discrete tokens. Inspired by GLM, we propose a General Point Model (GPM) which seamlessly integrates autoencoding and autoregressive tasks in point cloud transformer. This model is versatile, allowing fine-tuning for downstream point cloud representation tasks, as well as unconditional and conditional generation tasks. GPM enhances masked prediction in autoencoding through various forms of mask padding tasks, leading to improved performance in point cloud understanding. Additionally, GPM demonstrates highly competitive results in unconditional point cloud generation tasks, even exhibiting the potential for conditional generation tasks by modifying the input's conditional information. Compared to models like Point-BERT, MaskPoint and PointMAE, our GPM achieves superior performance in point cloud understanding tasks. Furthermore, the integration of autoregressive and autoencoding within the same transformer underscores its versatility across different downstream tasks.  ( 2 min )
  • Open

    A Neural Collapse Perspective on Feature Evolution in Graph Neural Networks. (arXiv:2307.01951v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have become increasingly popular for classification tasks on graph-structured data. Yet, the interplay between graph topology and feature evolution in GNNs is not well understood. In this paper, we focus on node-wise classification, illustrated with community detection on stochastic block model graphs, and explore the feature evolution through the lens of the "Neural Collapse" (NC) phenomenon. When training instance-wise deep classifiers (e.g. for image classification) beyond the zero training error point, NC demonstrates a reduction in the deepest features' within-class variability and an increased alignment of their class means to certain symmetric structures. We start with an empirical study that shows that a decrease in within-class variability is also prevalent in the node-wise classification setting, however, not to the extent observed in the instance-wise case. Then, we theoretically study this distinction. Specifically, we show that even an "optimistic" mathematical model requires that the graphs obey a strict structural condition in order to possess a minimizer with exact collapse. Interestingly, this condition is viable also for heterophilic graphs and relates to recent empirical studies on settings with improved GNNs' generalization. Furthermore, by studying the gradient dynamics of the theoretical model, we provide reasoning for the partial collapse observed empirically. Finally, we present a study on the evolution of within- and between-class feature variability across layers of a well-trained GNN and contrast the behavior with spectral methods.  ( 3 min )
    Assessing the overall and partial causal well-specification of nonlinear additive noise models. (arXiv:2310.16502v2 [stat.ME] UPDATED)
    We propose a method to detect model misspecifications in nonlinear causal additive and potentially heteroscedastic noise models. We aim to identify predictor variables for which we can infer the causal effect even in cases of such misspecification. We develop a general framework based on knowledge of the multivariate observational data distribution and we then propose an algorithm for finite sample data, discuss its asymptotic properties, and illustrate its performance on simulated and real data.
    Learning Rate Free Bayesian Inference in Constrained Domains. (arXiv:2305.14943v2 [stat.ML] UPDATED)
    We introduce a suite of new particle-based algorithms for sampling on constrained domains which are entirely learning rate free. Our approach leverages coin betting ideas from convex optimisation, and the viewpoint of constrained sampling as a mirrored optimisation problem on the space of probability measures. Based on this viewpoint, we also introduce a unifying framework for several existing constrained sampling algorithms, including mirrored Langevin dynamics and mirrored Stein variational gradient descent. We demonstrate the performance of our algorithms on a range of numerical examples, including sampling from targets on the simplex, sampling with fairness constraints, and constrained sampling problems in post-selection inference. Our results indicate that our algorithms achieve competitive performance with existing constrained sampling methods, without the need to tune any hyperparameters.
    Multi-scale Diffusion Denoised Smoothing. (arXiv:2310.16779v2 [cs.LG] UPDATED)
    Along with recent diffusion models, randomized smoothing has become one of a few tangible approaches that offers adversarial robustness to models at scale, e.g., those of large pre-trained models. Specifically, one can perform randomized smoothing on any classifier via a simple "denoise-and-classify" pipeline, so-called denoised smoothing, given that an accurate denoiser is available - such as diffusion model. In this paper, we present scalable methods to address the current trade-off between certified robustness and accuracy in denoised smoothing. Our key idea is to "selectively" apply smoothing among multiple noise scales, coined multi-scale smoothing, which can be efficiently implemented with a single diffusion model. This approach also suggests a new objective to compare the collective robustness of multi-scale smoothed classifiers, and questions which representation of diffusion model would maximize the objective. To address this, we propose to further fine-tune diffusion model (a) to perform consistent denoising whenever the original image is recoverable, but (b) to generate rather diverse outputs otherwise. Our experiments show that the proposed multi-scale smoothing scheme combined with diffusion fine-tuning enables strong certified robustness available with high noise level while maintaining its accuracy closer to non-smoothed classifiers.  ( 2 min )
    Robust Output Analysis with Monte-Carlo Methodology. (arXiv:2207.13612v3 [stat.ME] CROSS LISTED)
    In predictive modeling with simulation or machine learning, it is critical to accurately assess the quality of estimated values through output analysis. In recent decades output analysis has become enriched with methods that quantify the impact of input data uncertainty in the model outputs to increase robustness. However, most developments are applicable assuming that the input data adheres to a parametric family of distributions. We propose a unified output analysis framework for simulation and machine learning outputs through the lens of Monte Carlo sampling. This framework provides nonparametric quantification of the variance and bias induced in the outputs with higher-order accuracy. Our new bias-corrected estimation from the model outputs leverages the extension of fast iterative bootstrap sampling and higher-order influence functions. For the scalability of the proposed estimation methods, we devise budget-optimal rules and leverage control variates for variance reduction. Our theoretical and numerical results demonstrate a clear advantage in building more robust confidence intervals from the model outputs with higher coverage probability.  ( 2 min )
    A Mean Field Approach to Empirical Bayes Estimation in High-dimensional Linear Regression. (arXiv:2309.16843v2 [math.ST] UPDATED)
    We study empirical Bayes estimation in high-dimensional linear regression. To facilitate computationally efficient estimation of the underlying prior, we adopt a variational empirical Bayes approach, introduced originally in Carbonetto and Stephens (2012) and Kim et al. (2022). We establish asymptotic consistency of the nonparametric maximum likelihood estimator (NPMLE) and its (computable) naive mean field variational surrogate under mild assumptions on the design and the prior. Assuming, in addition, that the naive mean field approximation has a dominant optimizer, we develop a computationally efficient approximation to the oracle posterior distribution, and establish its accuracy under the 1-Wasserstein metric. This enables computationally feasible Bayesian inference; e.g., construction of posterior credible intervals with an average coverage guarantee, Bayes optimal estimation for the regression coefficients, estimation of the proportion of non-nulls, etc. Our analysis covers both deterministic and random designs, and accommodates correlations among the features. To the best of our knowledge, this provides the first rigorous nonparametric empirical Bayes method in a high-dimensional regression setting without sparsity.  ( 2 min )
    Statistically Valid Variable Importance Assessment through Conditional Permutations. (arXiv:2309.07593v2 [cs.LG] UPDATED)
    Variable importance assessment has become a crucial step in machine-learning applications when using complex learners, such as deep neural networks, on large-scale data. Removal-based importance assessment is currently the reference approach, particularly when statistical guarantees are sought to justify variable inclusion. It is often implemented with variable permutation schemes. On the flip side, these approaches risk misidentifying unimportant variables as important in the presence of correlations among covariates. Here we develop a systematic approach for studying Conditional Permutation Importance (CPI) that is model agnostic and computationally lean, as well as reusable benchmarks of state-of-the-art variable importance estimators. We show theoretically and empirically that $\textit{CPI}$ overcomes the limitations of standard permutation importance by providing accurate type-I error control. When used with a deep neural network, $\textit{CPI}$ consistently showed top accuracy across benchmarks. An experiment on real-world data analysis in a large-scale medical dataset showed that $\textit{CPI}$ provides a more parsimonious selection of statistically significant variables. Our results suggest that $\textit{CPI}$ can be readily used as drop-in replacement for permutation-based methods.  ( 3 min )
    Unconstrained Dynamic Regret via Sparse Coding. (arXiv:2301.13349v5 [cs.LG] UPDATED)
    Motivated by the challenge of nonstationarity in sequential decision making, we study Online Convex Optimization (OCO) under the coupling of two problem structures: the domain is unbounded, and the comparator sequence $u_1,\ldots,u_T$ is arbitrarily time-varying. As no algorithm can guarantee low regret simultaneously against all comparator sequences, handling this setting requires moving from minimax optimality to comparator adaptivity. That is, sensible regret bounds should depend on certain complexity measures of the comparator relative to one's prior knowledge. This paper achieves a new type of these adaptive regret bounds via a sparse coding framework. The complexity of the comparator is measured by its energy and its sparsity on a user-specified dictionary, which offers considerable versatility. Equipped with a wavelet dictionary for example, our framework improves the state-of-the-art bound (Jacobsen & Cutkosky, 2022) by adapting to both ($i$) the magnitude of the comparator average $||\bar u||=||\sum_{t=1}^Tu_t/T||$, rather than the maximum $\max_t||u_t||$; and ($ii$) the comparator variability $\sum_{t=1}^T||u_t-\bar u||$, rather than the uncentered sum $\sum_{t=1}^T||u_t||$. Furthermore, our analysis is simpler due to decoupling function approximation from regret minimization.  ( 3 min )
    Robust Covariate Shift Adaptation for Density-Ratio Estimation. (arXiv:2310.16638v2 [stat.ME] UPDATED)
    Consider a scenario where we have access to train data with both covariates and outcomes while test data only contains covariates. In this scenario, our primary aim is to predict the missing outcomes of the test data. With this objective in mind, we train parametric regression models under a covariate shift, where covariate distributions are different between the train and test data. For this problem, existing studies have proposed covariate shift adaptation via importance weighting using the density ratio. This approach averages the train data losses, each weighted by an estimated ratio of the covariate densities between the train and test data, to approximate the test-data risk. Although it allows us to obtain a test-data risk minimizer, its performance heavily relies on the accuracy of the density ratio estimation. Moreover, even if the density ratio can be consistently estimated, the estimation errors of the density ratio also yield bias in the estimators of the regression model's parameters of interest. To mitigate these challenges, we introduce a doubly robust estimator for covariate shift adaptation via importance weighting, which incorporates an additional estimator for the regression function. Leveraging double machine learning techniques, our estimator reduces the bias arising from the density ratio estimation errors. We demonstrate the asymptotic distribution of the regression parameter estimator. Notably, our estimator remains consistent if either the density ratio estimator or the regression function is consistent, showcasing its robustness against potential errors in density ratio estimation. Finally, we confirm the soundness of our proposed method via simulation studies.
    Cause-Effect Inference in Location-Scale Noise Models: Maximum Likelihood vs. Independence Testing. (arXiv:2301.12930v3 [cs.LG] UPDATED)
    A fundamental problem of causal discovery is cause-effect inference, learning the correct causal direction between two random variables. Significant progress has been made through modelling the effect as a function of its cause and a noise term, which allows us to leverage assumptions about the generating function class. The recently introduced heteroscedastic location-scale noise functional models (LSNMs) combine expressive power with identifiability guarantees. LSNM model selection based on maximizing likelihood achieves state-of-the-art accuracy, when the noise distributions are correctly specified. However, through an extensive empirical evaluation, we demonstrate that the accuracy deteriorates sharply when the form of the noise distribution is misspecified by the user. Our analysis shows that the failure occurs mainly when the conditional variance in the anti-causal direction is smaller than that in the causal direction. As an alternative, we find that causal model selection through residual independence testing is much more robust to noise misspecification and misleading conditional variance.  ( 2 min )
    Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory. (arXiv:2307.04204v2 [cs.LG] UPDATED)
    Cohen et al. (2021) empirically study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent (GD) trajectory and observe the Edge of Stability (EoS) phenomenon. The sharpness increases at the early phase of training (referred to as progressive sharpening), and eventually saturates close to the threshold of $2 / \text{(step size)}$. In this paper, we start by demonstrating through empirical studies that when the EoS phenomenon occurs, different GD trajectories (after a proper reparameterization) align on a specific bifurcation diagram independent of initialization. We then rigorously prove this trajectory alignment phenomenon for a two-layer fully-connected linear network and a single-neuron nonlinear network trained with a single data point. Our trajectory alignment analysis establishes both progressive sharpening and EoS phenomena, encompassing and extending recent findings in the literature.  ( 2 min )
    Statistical Component Separation for Targeted Signal Recovery in Noisy Mixtures. (arXiv:2306.15012v2 [stat.ML] UPDATED)
    Separating signals from an additive mixture may be an unnecessarily hard problem when one is only interested in specific properties of a given signal. In this work, we tackle simpler "statistical component separation" problems that focus on recovering a predefined set of statistical descriptors of a target signal from a noisy mixture. Assuming access to samples of the noise process, we investigate a method devised to match the statistics of the solution candidate corrupted by noise samples with those of the observed mixture. We first analyze the behavior of this method using simple examples with analytically tractable calculations. Then, we apply it in an image denoising context employing 1) wavelet-based descriptors, 2) ConvNet-based descriptors on astrophysics and ImageNet data. In the case of 1), we show that our method better recovers the descriptors of the target data than a standard denoising method in most situations. Additionally, despite not constructed for this purpose, it performs surprisingly well in terms of peak signal-to-noise ratio on full signal reconstruction. In comparison, representation 2) appears less suitable for image denoising. Finally, we extend this method by introducing a diffusive stepwise algorithm which gives a new perspective to the initial method and leads to promising results for image denoising under specific circumstances.  ( 3 min )
    Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension. (arXiv:2305.14077v2 [stat.ML] UPDATED)
    The success of over-parameterized neural networks trained to near-zero training error has caused great interest in the phenomenon of benign overfitting, where estimators are statistically consistent even though they interpolate noisy training data. While benign overfitting in fixed dimension has been established for some learning methods, current literature suggests that for regression with typical kernel methods and wide neural networks, benign overfitting requires a high-dimensional setting where the dimension grows with the sample size. In this paper, we show that the smoothness of the estimators, and not the dimension, is the key: benign overfitting is possible if and only if the estimator's derivatives are large enough. We generalize existing inconsistency results to non-interpolating models and more kernels to show that benign overfitting with moderate derivatives is impossible in fixed dimension. Conversely, we show that rate-optimal benign overfitting is possible for regression with a sequence of spiky-smooth kernels with large derivatives. Using neural tangent kernels, we translate our results to wide neural networks. We prove that while infinite-width networks do not overfit benignly with the ReLU activation, this can be fixed by adding small high-frequency fluctuations to the activation function. Our experiments verify that such neural networks, while overfitting, can indeed generalize well even on low-dimensional data sets.  ( 3 min )
    Label Embedding via Low-Coherence Matrices. (arXiv:2305.19470v3 [cs.LG] UPDATED)
    Label embedding is a framework for multiclass classification problems where each label is represented by a distinct vector of some fixed dimension, and training involves matching model output to the vector representing the correct label. While label embedding has been successfully applied in extreme classification and zero-shot learning, and offers both computational and statistical advantages, its theoretical foundations remain poorly understood. This work presents an analysis of label embedding in the context of extreme multiclass classification, where the number of classes $C$ is very large. We present an excess risk bound that reveals a trade-off between computational and statistical efficiency, quantified via the coherence of the embedding matrix. We further show that under the Massart noise condition, the statistical penalty for label embedding vanishes with sufficiently low coherence. Our analysis supports an algorithm that is simple, scalable, and easily parallelizable, and experimental results demonstrate its effectiveness in large-scale applications.  ( 2 min )
    Distribution-Free Model-Agnostic Regression Calibration via Nonparametric Methods. (arXiv:2305.12283v2 [cs.LG] UPDATED)
    In this paper, we consider the uncertainty quantification problem for regression models. Specifically, we consider an individual calibration objective for characterizing the quantiles of the prediction model. While such an objective is well-motivated from downstream tasks such as newsvendor cost, the existing methods have been largely heuristic and lack of statistical guarantee in terms of individual calibration. We show via simple examples that the existing methods focusing on population-level calibration guarantees such as average calibration or sharpness can lead to harmful and unexpected results. We propose simple nonparametric calibration methods that are agnostic of the underlying prediction model and enjoy both computational efficiency and statistical consistency. Our approach enables a better understanding of the possibility of individual calibration, and we establish matching upper and lower bounds for the calibration error of our proposed methods. Technically, our analysis combines the nonparametric analysis with a covering number argument for parametric analysis, which advances the existing theoretical analyses in the literature of nonparametric density estimation and quantile bandit problems. Importantly, the nonparametric perspective sheds new theoretical insights into regression calibration in terms of the curse of dimensionality and reconciles the existing results on the impossibility of individual calibration. To our knowledge, we make the first effort to reach both individual calibration and finite-sample guarantee with minimal assumptions in terms of conformal prediction. Numerical experiments show the advantage of such a simple approach under various metrics, and also under covariates shift. We hope our work provides a simple benchmark and a starting point of theoretical ground for future research on regression calibration.  ( 3 min )
    No-Regret Online Reinforcement Learning with Adversarial Losses and Transitions. (arXiv:2305.17380v3 [cs.LG] UPDATED)
    Existing online learning algorithms for adversarial Markov Decision Processes achieve ${O}(\sqrt{T})$ regret after $T$ rounds of interactions even if the loss functions are chosen arbitrarily by an adversary, with the caveat that the transition function has to be fixed. This is because it has been shown that adversarial transition functions make no-regret learning impossible. Despite such impossibility results, in this work, we develop algorithms that can handle both adversarial losses and adversarial transitions, with regret increasing smoothly in the degree of maliciousness of the adversary. More concretely, we first propose an algorithm that enjoys $\widetilde{{O}}(\sqrt{T} + C^{\textsf{P}})$ regret where $C^{\textsf{P}}$ measures how adversarial the transition functions are and can be at most ${O}(T)$. While this algorithm itself requires knowledge of $C^{\textsf{P}}$, we further develop a black-box reduction approach that removes this requirement. Moreover, we also show that further refinements of the algorithm not only maintains the same regret bound, but also simultaneously adapts to easier environments (where losses are generated in a certain stochastically constrained manner as in Jin et al. [2021]) and achieves $\widetilde{{O}}(U + \sqrt{UC^{\textsf{L}}} + C^{\textsf{P}})$ regret, where $U$ is some standard gap-dependent coefficient and $C^{\textsf{L}}$ is the amount of corruption on losses.  ( 3 min )
    Improved Best-of-Both-Worlds Guarantees for Multi-Armed Bandits: FTRL with General Regularizers and Multiple Optimal Arms. (arXiv:2302.13534v2 [cs.LG] UPDATED)
    We study the problem of designing adaptive multi-armed bandit algorithms that perform optimally in both the stochastic setting and the adversarial setting simultaneously (often known as a best-of-both-world guarantee). A line of recent works shows that when configured and analyzed properly, the Follow-the-Regularized-Leader (FTRL) algorithm, originally designed for the adversarial setting, can in fact optimally adapt to the stochastic setting as well. Such results, however, critically rely on an assumption that there exists one unique optimal arm. Recently, Ito (2021) took the first step to remove such an undesirable uniqueness assumption for one particular FTRL algorithm with the $\frac{1}{2}$-Tsallis entropy regularizer. In this work, we significantly improve and generalize this result, showing that uniqueness is unnecessary for FTRL with a broad family of regularizers and a new learning rate schedule. For some regularizers, our regret bounds also improve upon prior results even when uniqueness holds. We further provide an application of our results to the decoupled exploration and exploitation problem, demonstrating that our techniques are broadly applicable.  ( 3 min )
    Monte Carlo guided Diffusion for Bayesian linear inverse problems. (arXiv:2308.07983v2 [stat.ML] UPDATED)
    Ill-posed linear inverse problems arise frequently in various applications, from computational photography to medical imaging. A recent line of research exploits Bayesian inference with informative priors to handle the ill-posedness of such problems. Amongst such priors, score-based generative models (SGM) have recently been successfully applied to several different inverse problems. In this study, we exploit the particular structure of the prior defined by the SGM to define a sequence of intermediate linear inverse problems. As the noise level decreases, the posteriors of these inverse problems get closer to the target posterior of the original inverse problem. To sample from this sequence of posteriors, we propose the use of Sequential Monte Carlo (SMC) methods. The proposed algorithm, MCGDiff, is shown to be theoretically grounded and we provide numerical simulations showing that it outperforms competing baselines when dealing with ill-posed inverse problems in a Bayesian setting.  ( 2 min )
    Hierarchical clustering with OWA-based linkages, the Lance-Williams formula, and dendrogram inversions. (arXiv:2303.05683v2 [stat.ML] UPDATED)
    Agglomerative hierarchical clustering based on Ordered Weighted Averaging (OWA) operators not only generalises the single, complete, and average linkages, but also includes intercluster distances based on a few nearest or farthest neighbours, trimmed and winsorised means of pairwise point similarities, amongst many others. We explore the relationships between the famous Lance-Williams update formula and the extended OWA-based linkages with weights generated via infinite coefficient sequences. Furthermore, we provide some conditions for the weight generators to guarantee the resulting dendrograms to be free from unaesthetic inversions.  ( 2 min )
    RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers via Randomized Deletion. (arXiv:2302.01757v2 [cs.CR] UPDATED)
    Randomized smoothing is a leading approach for constructing classifiers that are certifiably robust against adversarial examples. Existing work on randomized smoothing has focused on classifiers with continuous inputs, such as images, where $\ell_p$-norm bounded adversaries are commonly studied. However, there has been limited work for classifiers with discrete or variable-size inputs, such as for source code, which require different threat models and smoothing mechanisms. In this work, we adapt randomized smoothing for discrete sequence classifiers to provide certified robustness against edit distance-bounded adversaries. Our proposed smoothing mechanism randomized deletion (RS-Del) applies random deletion edits, which are (perhaps surprisingly) sufficient to confer robustness against adversarial deletion, insertion and substitution edits. Our proof of certification deviates from the established Neyman-Pearson approach, which is intractable in our setting, and is instead organized around longest common subsequences. We present a case study on malware detection--a binary classification problem on byte sequences where classifier evasion is a well-established threat model. When applied to the popular MalConv malware detection model, our smoothing mechanism RS-Del achieves a certified accuracy of 91% at an edit distance radius of 128 bytes.  ( 3 min )
    Mutual information of spin systems from autoregressive neural networks. (arXiv:2304.13412v2 [cond-mat.stat-mech] UPDATED)
    We describe a new direct method to estimate bipartite mutual information of a classical spin system based on Monte Carlo sampling enhanced by autoregressive neural networks. It allows studying arbitrary geometries of subsystems and can be generalized to classical field theories. We demonstrate it on the Ising model for four partitionings, including a multiply-connected even-odd division. We show that the area law is satisfied for temperatures away from the critical temperature: the constant term is universal, whereas the proportionality coefficient is different for the even-odd partitioning.  ( 2 min )
    Gaussian Membership Inference Privacy. (arXiv:2306.07273v2 [cs.LG] UPDATED)
    We propose a novel and practical privacy notion called $f$-Membership Inference Privacy ($f$-MIP), which explicitly considers the capabilities of realistic adversaries under the membership inference attack threat model. Consequently, $f$-MIP offers interpretable privacy guarantees and improved utility (e.g., better classification accuracy). In particular, we derive a parametric family of $f$-MIP guarantees that we refer to as $\mu$-Gaussian Membership Inference Privacy ($\mu$-GMIP) by theoretically analyzing likelihood ratio-based membership inference attacks on stochastic gradient descent (SGD). Our analysis highlights that models trained with standard SGD already offer an elementary level of MIP. Additionally, we show how $f$-MIP can be amplified by adding noise to gradient updates. Our analysis further yields an analytical membership inference attack that offers two distinct advantages over previous approaches. First, unlike existing state-of-the-art attacks that require training hundreds of shadow models, our attack does not require any shadow model. Second, our analytical attack enables straightforward auditing of our privacy notion $f$-MIP. Finally, we quantify how various hyperparameters (e.g., batch size, number of model parameters) and specific data characteristics determine an attacker's ability to accurately infer a point's membership in the training set. We demonstrate the effectiveness of our method on models trained on vision and tabular datasets.  ( 2 min )
    A Batch-to-Online Transformation under Random-Order Model. (arXiv:2306.07163v2 [cs.LG] UPDATED)
    We introduce a transformation framework that can be utilized to develop online algorithms with low $\epsilon$-approximate regret in the random-order model from offline approximation algorithms. We first give a general reduction theorem that transforms an offline approximation algorithm with low average sensitivity to an online algorithm with low $\epsilon$-approximate regret. We then demonstrate that offline approximation algorithms can be transformed into a low-sensitivity version using a coreset construction method. To showcase the versatility of our approach, we apply it to various problems, including online $(k,z)$-clustering, online matrix approximation, and online regression, and successfully achieve polylogarithmic $\epsilon$-approximate regret for each problem. Moreover, we show that in all three cases, our algorithm also enjoys low inconsistency, which may be desired in some online applications.  ( 2 min )
    Instability of computer vision models is a necessary result of the task itself. (arXiv:2310.17559v1 [cs.CV])
    Adversarial examples resulting from instability of current computer vision models are an extremely important topic due to their potential to compromise any application. In this paper we demonstrate that instability is inevitable due to a) symmetries (translational invariance) of the data, b) the categorical nature of the classification task, and c) the fundamental discrepancy of classifying images as objects themselves. The issue is further exacerbated by non-exhaustive labelling of the training data. Therefore we conclude that instability is a necessary result of how the problem of computer vision is currently formulated. While the problem cannot be eliminated, through the analysis of the causes, we have arrived at ways how it can be partially alleviated. These include i) increasing the resolution of images, ii) providing contextual information for the image, iii) exhaustive labelling of training data, and iv) preventing attackers from frequent access to the computer vision system.  ( 2 min )
    Convergence of flow-based generative models via proximal gradient descent in Wasserstein space. (arXiv:2310.17582v1 [stat.ML])
    Flow-based generative models enjoy certain advantages in computing the data generation and the likelihood, and have recently shown competitive empirical performance. Compared to the accumulating theoretical studies on related score-based diffusion models, analysis of flow-based models, which are deterministic in both forward (data-to-noise) and reverse (noise-to-data) directions, remain sparse. In this paper, we provide a theoretical guarantee of generating data distribution by a progressive flow model, the so-called JKO flow model, which implements the Jordan-Kinderleherer-Otto (JKO) scheme in a normalizing flow network. Leveraging the exponential convergence of the proximal gradient descent (GD) in Wasserstein space, we prove the Kullback-Leibler (KL) guarantee of data generation by a JKO flow model to be $O(\varepsilon^2)$ when using $N \lesssim \log (1/\varepsilon)$ many JKO steps ($N$ Residual Blocks in the flow) where $\varepsilon $ is the error in the per-step first-order condition. The assumption on data density is merely a finite second moment, and the theory extends to data distributions without density and when there are inversion errors in the reverse process where we obtain KL-$W_2$ mixed error guarantees. The non-asymptotic convergence rate of the JKO-type $W_2$-proximal GD is proved for a general class of convex objective functionals that includes the KL divergence as a special case, which can be of independent interest.  ( 3 min )
    Bounding Box-based Multi-objective Bayesian Optimization of Risk Measures under Input Uncertainty. (arXiv:2301.11588v2 [stat.ML] UPDATED)
    In this study, we propose a novel multi-objective Bayesian optimization (MOBO) method to efficiently identify the Pareto front (PF) defined by risk measures for black-box functions under the presence of input uncertainty (IU). Existing BO methods for Pareto optimization in the presence of IU are risk-specific or without theoretical guarantees, whereas our proposed method addresses general risk measures and has theoretical guarantees. The basic idea of the proposed method is to assume a Gaussian process (GP) model for the black-box function and to construct high-probability bounding boxes for the risk measures using the GP model. Furthermore, in order to reduce the uncertainty of non-dominated bounding boxes, we propose a method of selecting the next evaluation point using a maximin distance defined by the maximum value of a quasi distance based on bounding boxes. As theoretical analysis, we prove that the algorithm can return an arbitrary-accurate solution in a finite number of iterations with high probability, for various risk measures such as Bayes risk, worst-case risk, and value-at-risk. We also give a theoretical analysis that takes into account approximation errors because there exist non-negligible approximation errors (e.g., finite approximation of PFs and sampling-based approximation of bounding boxes) in practice. We confirm that the proposed method outperforms compared with existing methods not only in the setting with IU but also in the setting of ordinary MOBO through numerical experiments.  ( 3 min )
    Curvature Filtrations for Graph Generative Model Evaluation. (arXiv:2301.12906v3 [cs.LG] UPDATED)
    Graph generative model evaluation necessitates understanding differences between graphs on the distributional level. This entails being able to harness salient attributes of graphs in an efficient manner. Curvature constitutes one such property that has recently proved its utility in characterising graphs. Its expressive properties, stability, and practical utility in model evaluation remain largely unexplored, however. We combine graph curvature descriptors with emerging methods from topological data analysis to obtain robust, expressive descriptors for evaluating graph generative models.  ( 2 min )
    The statistical thermodynamics of generative diffusion models. (arXiv:2310.17467v1 [stat.ML])
    Generative diffusion models have achieved spectacular performance in many areas of generative modeling. While the fundamental ideas behind these models come from non-equilibrium physics, in this paper we show that many aspects of these models can be understood using the tools of equilibrium statistical mechanics. Using this reformulation, we show that generative diffusion models undergo second-order phase transitions corresponding to symmetry breaking phenomena. We argue that this lead to a form of instability that lies at the heart of their generative capabilities and that can be described by a set of mean field critical exponents. We conclude by analyzing recent work connecting diffusion models and associative memory networks in view of the thermodynamic formulations.  ( 2 min )
    Sequential Memory with Temporal Predictive Coding. (arXiv:2305.11982v2 [q-bio.NC] UPDATED)
    Forming accurate memory of sequential stimuli is a fundamental function of biological agents. However, the computational mechanism underlying sequential memory in the brain remains unclear. Inspired by neuroscience theories and recent successes in applying predictive coding (PC) to \emph{static} memory tasks, in this work we propose a novel PC-based model for \emph{sequential} memory, called \emph{temporal predictive coding} (tPC). We show that our tPC models can memorize and retrieve sequential inputs accurately with a biologically plausible neural implementation. Importantly, our analytical study reveals that tPC can be viewed as a classical Asymmetric Hopfield Network (AHN) with an implicit statistical whitening process, which leads to more stable performance in sequential memory tasks of structured inputs. Moreover, we find that tPC exhibits properties consistent with behavioral observations and theories in neuroscience, thereby strengthening its biological relevance. Our work establishes a possible computational mechanism underlying sequential memory in the brain that can also be theoretically interpreted using existing memory model frameworks.  ( 2 min )
    Small Total-Cost Constraints in Contextual Bandits with Knapsacks, with Application to Fairness. (arXiv:2305.15807v2 [stat.ML] UPDATED)
    We consider contextual bandit problems with knapsacks [CBwK], a problem where at each round, a scalar reward is obtained and vector-valued costs are suffered. The learner aims to maximize the cumulative rewards while ensuring that the cumulative costs are lower than some predetermined cost constraints. We assume that contexts come from a continuous set, that costs can be signed, and that the expected reward and cost functions, while unknown, may be uniformly estimated -- a typical assumption in the literature. In this setting, total cost constraints had so far to be at least of order $T^{3/4}$, where $T$ is the number of rounds, and were even typically assumed to depend linearly on $T$. We are however motivated to use CBwK to impose a fairness constraint of equalized average costs between groups: the budget associated with the corresponding cost constraints should be as close as possible to the natural deviations, of order $\sqrt{T}$. To that end, we introduce a dual strategy based on projected-gradient-descent updates, that is able to deal with total-cost constraints of the order of $\sqrt{T}$ up to poly-logarithmic terms. This strategy is more direct and simpler than existing strategies in the literature. It relies on a careful, adaptive, tuning of the step size.  ( 3 min )
    Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates. (arXiv:2310.17074v1 [cs.LG])
    In this work, we theoretically investigate the generalization properties of neural networks (NN) trained by stochastic gradient descent (SGD) algorithm with large learning rates. Under such a training regime, our finding is that, the oscillation of the NN weights caused by the large learning rate SGD training turns out to be beneficial to the generalization of the NN, which potentially improves over the same NN trained by SGD with small learning rates that converges more smoothly. In view of this finding, we call such a phenomenon "benign oscillation". Our theory towards demystifying such a phenomenon builds upon the feature learning perspective of deep learning. Specifically, we consider a feature-noise data generation model that consists of (i) weak features which have a small $\ell_2$-norm and appear in each data point; (ii) strong features which have a larger $\ell_2$-norm but only appear in a certain fraction of all data points; and (iii) noise. We prove that NNs trained by oscillating SGD with a large learning rate can effectively learn the weak features in the presence of those strong features. In contrast, NNs trained by SGD with a small learning rate can only learn the strong features but makes little progress in learning the weak features. Consequently, when it comes to the new testing data which consist of only weak features, the NN trained by oscillating SGD with a large learning rate could still make correct predictions consistently, while the NN trained by small learning rate SGD fails. Our theory sheds light on how large learning rate training benefits the generalization of NNs. Experimental results demonstrate our finding on "benign oscillation".  ( 3 min )
    Unifying GANs and Score-Based Diffusion as Generative Particle Models. (arXiv:2305.16150v2 [cs.LG] UPDATED)
    Particle-based deep generative models, such as gradient flows and score-based diffusion models, have recently gained traction thanks to their striking performance. Their principle of displacing particle distributions using differential equations is conventionally seen as opposed to the previously widespread generative adversarial networks (GANs), which involve training a pushforward generator network. In this paper we challenge this interpretation, and propose a novel framework that unifies particle and adversarial generative models by framing generator training as a generalization of particle models. This suggests that a generator is an optional addition to any such generative model. Consequently, integrating a generator into a score-based diffusion model and training a GAN without a generator naturally emerge from our framework. We empirically test the viability of these original models as proofs of concepts of potential applications of our framework.  ( 2 min )
    Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits. (arXiv:2107.11419v2 [stat.ML] UPDATED)
    We consider nonstationary multi-armed bandit problems where the model parameters of the arms change over time. We introduce the adaptive resetting bandit (ADR-bandit), a bandit algorithm class that leverages adaptive windowing techniques from literature on data streams. We first provide new guarantees on the quality of estimators resulting from adaptive windowing techniques, which are of independent interest. Furthermore, we conduct a finite-time analysis of ADR-bandit in two typical environments: an abrupt environment where changes occur instantaneously and a gradual environment where changes occur progressively. We demonstrate that ADR-bandit has nearly optimal performance when abrupt or gradual changes occur in a coordinated manner that we call global changes. We demonstrate that forced exploration is unnecessary when we assume such global changes. Unlike the existing nonstationary bandit algorithms, ADR-bandit has optimal performance in stationary environments as well as nonstationary environments with global changes. Our experiments show that the proposed algorithms outperform the existing approaches in synthetic and real-world environments.  ( 2 min )
    The Expressive Power of Low-Rank Adaptation. (arXiv:2310.17513v1 [cs.LG])
    Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model $f$ to accurately represent any smaller target model $\overline{f}$ if LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of }\overline{f}}{\text{depth of }f}$. We also quantify the approximation error when LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters.  ( 2 min )
    Squared Neural Families: A New Class of Tractable Density Models. (arXiv:2305.13552v2 [cs.LG] UPDATED)
    Flexible models for probability distributions are an essential ingredient in many machine learning tasks. We develop and investigate a new class of probability distributions, which we call a Squared Neural Family (SNEFY), formed by squaring the 2-norm of a neural network and normalising it with respect to a base measure. Following the reasoning similar to the well established connections between infinitely wide neural networks and Gaussian processes, we show that SNEFYs admit closed form normalising constants in many cases of interest, thereby resulting in flexible yet fully tractable density models. SNEFYs strictly generalise classical exponential families, are closed under conditioning, and have tractable marginal distributions. Their utility is illustrated on a variety of density estimation, conditional density estimation, and density estimation with missing data tasks.  ( 2 min )
    Generative Fractional Diffusion Models. (arXiv:2310.17638v1 [cs.LG])
    We generalize the continuous time framework for score-based generative models from an underlying Brownian motion (BM) to an approximation of fractional Brownian motion (FBM). We derive a continuous reparameterization trick and the reverse time model by representing FBM as a stochastic integral over a family of Ornstein-Uhlenbeck processes to define generative fractional diffusion models (GFDM) with driving noise converging to a non-Markovian process of infinite quadratic variation. The Hurst index $H\in(0,1)$ of FBM enables control of the roughness of the distribution transforming path. To the best of our knowledge, this is the first attempt to build a generative model upon a stochastic process with infinite quadratic variation.  ( 2 min )
    Improving Neural Additive Models with Bayesian Principles. (arXiv:2305.16905v2 [stat.ML] UPDATED)
    Neural additive models (NAMs) can improve the interpretability of deep neural networks by handling input features in separate additive sub-networks. However, they lack inherent mechanisms that provide calibrated uncertainties and enable selection of relevant features and interactions. Approaching NAMs from a Bayesian perspective, we enhance them in three primary ways, namely by a) providing credible intervals for the individual additive sub-networks; b) estimating the marginal likelihood to perform an implicit selection of features via an empirical Bayes procedure; and c) enabling a ranking of feature pairs as candidates for second-order interaction in fine-tuned models. In particular, we develop Laplace-approximated NAMs (LA-NAMs), which show improved empirical performance on tabular datasets and challenging real-world medical tasks.  ( 2 min )
    Looping in the Human: Collaborative and Explainable Bayesian Optimization. (arXiv:2310.17273v1 [cs.LG])
    Like many optimizers, Bayesian optimization often falls short of gaining user trust due to opacity. While attempts have been made to develop human-centric optimizers, they typically assume user knowledge is well-specified and error-free, employing users mainly as supervisors of the optimization process. We relax these assumptions and propose a more balanced human-AI partnership with our Collaborative and Explainable Bayesian Optimization (CoExBO) framework. Instead of explicitly requiring a user to provide a knowledge model, CoExBO employs preference learning to seamlessly integrate human insights into the optimization, resulting in algorithmic suggestions that resonate with user preference. CoExBO explains its candidate selection every iteration to foster trust, empowering users with a clearer grasp of the optimization. Furthermore, CoExBO offers a no-harm guarantee, allowing users to make mistakes; even with extreme adversarial interventions, the algorithm converges asymptotically to a vanilla Bayesian optimization. We validate CoExBO's efficacy through human-AI teaming experiments in lithium-ion battery design, highlighting substantial improvements over conventional methods.  ( 2 min )
    A framework for benchmarking clustering algorithms. (arXiv:2209.09493v3 [cs.LG] UPDATED)
    The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark dataset collections referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at .  ( 2 min )
    Kernel Stein Discrepancy thinning: a theoretical perspective of pathologies and a practical fix with regularization. (arXiv:2301.13528v3 [math.ST] UPDATED)
    Stein thinning is a promising algorithm proposed by (Riabiz et al., 2022) for post-processing outputs of Markov chain Monte Carlo (MCMC). The main principle is to greedily minimize the kernelized Stein discrepancy (KSD), which only requires the gradient of the log-target distribution, and is thus well-suited for Bayesian inference. The main advantages of Stein thinning are the automatic remove of the burn-in period, the correction of the bias introduced by recent MCMC algorithms, and the asymptotic properties of convergence towards the target distribution. Nevertheless, Stein thinning suffers from several empirical pathologies, which may result in poor approximations, as observed in the literature. In this article, we conduct a theoretical analysis of these pathologies, to clearly identify the mechanisms at stake, and suggest improved strategies. Then, we introduce the regularized Stein thinning algorithm to alleviate the identified pathologies. Finally, theoretical guarantees and extensive experiments show the high efficiency of the proposed algorithm. An implementation of regularized Stein thinning as the kernax library in python and JAX is available at https://gitlab.com/drti/kernax.  ( 3 min )
    Approximate Leave-one-out Cross Validation for Regression with $\ell_1$ Regularizers (extended version). (arXiv:2310.17629v1 [math.ST])
    The out-of-sample error (OO) is the main quantity of interest in risk estimation and model selection. Leave-one-out cross validation (LO) offers a (nearly) distribution-free yet computationally demanding approach to estimate OO. Recent theoretical work showed that approximate leave-one-out cross validation (ALO) is a computationally efficient and statistically reliable estimate of LO (and OO) for generalized linear models with differentiable regularizers. For problems involving non-differentiable regularizers, despite significant empirical evidence, the theoretical understanding of ALO's error remains unknown. In this paper, we present a novel theory for a wide class of problems in the generalized linear model family with non-differentiable regularizers. We bound the error |ALO - LO| in terms of intuitive metrics such as the size of leave-i-out perturbations in active sets, sample size n, number of features p and regularization parameters. As a consequence, for the $\ell_1$-regularized problems, we show that |ALO - LO| goes to zero as p goes to infinity while n/p and SNR are fixed and bounded.  ( 2 min )
    Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning. (arXiv:2301.12593v2 [cs.LG] UPDATED)
    Many real-world domains require safe decision making in uncertain environments. In this work, we introduce a deep reinforcement learning framework for approaching this important problem. We consider a distribution over transition models, and apply a risk-averse perspective towards model uncertainty through the use of coherent distortion risk measures. We provide robustness guarantees for this framework by showing it is equivalent to a specific class of distributionally robust safe reinforcement learning problems. Unlike existing approaches to robustness in deep reinforcement learning, however, our formulation does not involve minimax optimization. This leads to an efficient, model-free implementation of our approach that only requires standard data collection from a single training environment. In experiments on continuous control tasks with safety constraints, we demonstrate that our framework produces robust performance and safety at deployment time across a range of perturbed test environments.  ( 2 min )
    Online Estimation and Community Detection of Network Point Processes for Event Streams. (arXiv:2009.01742v3 [cs.SI] UPDATED)
    A common goal in network modeling is to uncover the latent community structure present among nodes. For many real-world networks, the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of these temporal dynamics of interactions is to use point processes as the foundation of network models for community detection. Computational complexity hampers the scalability of such approaches to large sparse networks. To circumvent this challenge, we propose a fast online variational inference algorithm for estimating the latent structure underlying dynamic event arrivals on a network, using continuous-time point process latent network models. We describe this procedure for networks models capturing community structure. This structure can be learned as new events are observed on the network, updating the inferred community assignments. We investigate the theoretical properties of such an inference scheme, and provide regret bounds on the loss function of this procedure. The proposed inference procedure is then thoroughly compared, using both simulation studies and real data, to non-online variants. We demonstrate that online inference can obtain comparable performance, in terms of community recovery, to non-online variants, while realising computational gains. Our proposed inference framework can also be readily modified to incorporate other popular network structures.  ( 3 min )
    Uncovering Meanings of Embeddings via Partial Orthogonality. (arXiv:2310.17611v1 [cs.LG])
    Machine learning tools often rely on embedding text as vectors of real numbers. In this paper, we study how the semantic structure of language is encoded in the algebraic structure of such embeddings. Specifically, we look at a notion of ``semantic independence'' capturing the idea that, e.g., ``eggplant'' and ``tomato'' are independent given ``vegetable''. Although such examples are intuitive, it is difficult to formalize such a notion of semantic independence. The key observation here is that any sensible formalization should obey a set of so-called independence axioms, and thus any algebraic encoding of this structure should also obey these axioms. This leads us naturally to use partial orthogonality as the relevant algebraic structure. We develop theory and methods that allow us to demonstrate that partial orthogonality does indeed capture semantic independence. Complementary to this, we also introduce the concept of independence preserving embeddings where embeddings preserve the conditional independence structures of a distribution, and we prove the existence of such embeddings and approximations to them.  ( 2 min )
    Large-Scale Gaussian Processes via Alternating Projection. (arXiv:2310.17137v1 [cs.LG])
    Gaussian process (GP) hyperparameter optimization requires repeatedly solving linear systems with $n \times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative numerical methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the corresponding kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling \emph{mini-batching}. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove our method enjoys linear convergence and empirically we demonstrate its robustness to ill-conditioning. On large-scale benchmark datasets up to four million datapoints our approach accelerates training by a factor of 2$\times$ to 27$\times$ compared to CG.  ( 2 min )
    Learning an Inventory Control Policy with General Inventory Arrival Dynamics. (arXiv:2310.17168v1 [cs.LG])
    In this paper we address the problem of learning and backtesting inventory control policies in the presence of general arrival dynamics -- which we term as a quantity-over-time arrivals model (QOT). We also allow for order quantities to be modified as a post-processing step to meet vendor constraints such as order minimum and batch size constraints -- a common practice in real supply chains. To the best of our knowledge this is the first work to handle either arbitrary arrival dynamics or an arbitrary downstream post-processing of order quantities. Building upon recent work (Madeka et al., 2022) we similarly formulate the periodic review inventory control problem as an exogenous decision process, where most of the state is outside the control of the agent. Madeka et al. (2022) show how to construct a simulator that replays historic data to solve this class of problem. In our case, we incorporate a deep generative model for the arrivals process as part of the history replay. By formulating the problem as an exogenous decision process, we can apply results from Madeka et al. (2022) to obtain a reduction to supervised learning. Finally, we show via simulation studies that this approach yields statistically significant improvements in profitability over production baselines. Using data from an ongoing real-world A/B test, we show that Gen-QOT generalizes well to off-policy data.  ( 3 min )
    Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult. (arXiv:2310.17087v1 [cs.LG])
    Large learning rates, when applied to gradient descent for nonconvex optimization, yield various implicit biases including the edge of stability (Cohen et al., 2021), balancing (Wang et al., 2022), and catapult (Lewkowycz et al., 2020). These phenomena cannot be well explained by classical optimization theory. Though significant theoretical progress has been made in understanding these implicit biases, it remains unclear for which objective functions would they occur. This paper provides an initial step in answering this question, namely that these implicit biases are in fact various tips of the same iceberg. They occur when the objective function of optimization has some good regularity, which, in combination with a provable preference of large learning rate gradient descent for moving toward flatter regions, results in these nontrivial dynamical phenomena. To establish this result, we develop a new global convergence theory under large learning rates, for a family of nonconvex functions without globally Lipschitz continuous gradient, which was typically assumed in existing convergence analysis. A byproduct is the first non-asymptotic convergence rate bound for large-learning-rate gradient descent optimization of nonconvex functions. We also validate our theory with experiments on neural networks, where different losses, activation functions, and batch normalization all can significantly affect regularity and lead to very different training dynamics.  ( 3 min )
    Learning Regularized Graphon Mean-Field Games with Unknown Graphons. (arXiv:2310.17531v1 [cs.GT])
    We design and analyze reinforcement learning algorithms for Graphon Mean-Field Games (GMFGs). In contrast to previous works that require the precise values of the graphons, we aim to learn the Nash Equilibrium (NE) of the regularized GMFGs when the graphons are unknown. Our contributions are threefold. First, we propose the Proximal Policy Optimization for GMFG (GMFG-PPO) algorithm and show that it converges at a rate of $O(T^{-1/3})$ after $T$ iterations with an estimation oracle, improving on a previous work by Xie et al. (ICML, 2021). Second, using kernel embedding of distributions, we design efficient algorithms to estimate the transition kernels, reward functions, and graphons from sampled agents. Convergence rates are then derived when the positions of the agents are either known or unknown. Results for the combination of the optimization algorithm GMFG-PPO and the estimation algorithm are then provided. These algorithms are the first specifically designed for learning graphons from sampled agents. Finally, the efficacy of the proposed algorithms are corroborated through simulations. These simulations demonstrate that learning the unknown graphons reduces the exploitability effectively.  ( 2 min )
    On the Identifiability and Interpretability of Gaussian Process Models. (arXiv:2310.17023v1 [stat.ML])
    In this paper, we critically examine the prevalent practice of using additive mixtures of Mat\'ern kernels in single-output Gaussian process (GP) models and explore the properties of multiplicative mixtures of Mat\'ern kernels for multi-output GP models. For the single-output case, we derive a series of theoretical results showing that the smoothness of a mixture of Mat\'ern kernels is determined by the least smooth component and that a GP with such a kernel is effectively equivalent to the least smooth kernel component. Furthermore, we demonstrate that none of the mixing weights or parameters within individual kernel components are identifiable. We then turn our attention to multi-output GP models and analyze the identifiability of the covariance matrix $A$ in the multiplicative kernel $K(x,y) = AK_0(x,y)$, where $K_0$ is a standard single output kernel such as Mat\'ern. We show that $A$ is identifiable up to a multiplicative constant, suggesting that multiplicative mixtures are well suited for multi-output tasks. Our findings are supported by extensive simulations and real applications for both single- and multi-output settings. This work provides insight into kernel selection and interpretation for GP models, emphasizing the importance of choosing appropriate kernel structures for different tasks.  ( 2 min )
    Characterizing the Implicit Bias of Regularized SGD in Rank Minimization. (arXiv:2206.05794v6 [cs.LG] UPDATED)
    We study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Unlike previous literature, our analysis does not rely on assumptions about the data, convergence, or optimality of the weight matrices and applies to a wide range of neural network architectures of any width or depth. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization.  ( 2 min )
    Coreset Markov Chain Monte Carlo. (arXiv:2310.17063v1 [stat.CO])
    A Bayesian coreset is a small, weighted subset of data that replaces the full dataset during inference in order to reduce computational cost. However, state of the art methods for tuning coreset weights are expensive, require nontrivial user input, and impose constraints on the model. In this work, we propose a new method -- Coreset MCMC -- that simulates a Markov chain targeting the coreset posterior, while simultaneously updating the coreset weights using those same draws. Coreset MCMC is simple to implement and tune, and can be used with any existing MCMC kernel. We analyze Coreset MCMC in a representative setting to obtain key insights about the convergence behaviour of the method. Empirical results demonstrate that Coreset MCMC provides higher quality posterior approximations and reduced computational cost compared with other coreset construction methods. Further, compared with other general subsampling MCMC methods, we find that Coreset MCMC has a higher sampling efficiency with competitively accurate posterior approximations.  ( 2 min )
    Inside the black box: Neural network-based real-time prediction of US recessions. (arXiv:2310.17571v1 [econ.EM])
    Feedforward neural network (FFN) and two specific types of recurrent neural network, long short-term memory (LSTM) and gated recurrent unit (GRU), are used for modeling US recessions in the period from 1967 to 2021. The estimated models are then employed to conduct real-time predictions of the Great Recession and the Covid-19 recession in US. Their predictive performances are compared to those of the traditional linear models, the logistic regression model both with and without the ridge penalty. The out-of-sample performance suggests the application of LSTM and GRU in the area of recession forecasting, especially for the long-term forecasting tasks. They outperform other types of models across 5 forecasting horizons with respect to different types of statistical performance metrics. Shapley additive explanations (SHAP) method is applied to the fitted GRUs across different forecasting horizons to gain insight into the feature importance. The evaluation of predictor importance differs between the GRU and ridge logistic regression models, as reflected in the variable order determined by SHAP values. When considering the top 5 predictors, key indicators such as the S\&P 500 index, real GDP, and private residential fixed investment consistently appear for short-term forecasts (up to 3 months). In contrast, for longer-term predictions (6 months or more), the term spread and producer price index become more prominent. These findings are supported by both local interpretable model-agnostic explanations (LIME) and marginal effects.  ( 3 min )
    A qualitative difference between gradient flows of convex functions in finite- and infinite-dimensional Hilbert spaces. (arXiv:2310.17610v1 [math.OC])
    We consider gradient flow/gradient descent and heavy ball/accelerated gradient descent optimization for convex objective functions. In the gradient flow case, we prove the following: 1. If $f$ does not have a minimizer, the convergence $f(x_t)\to \inf f$ can be arbitrarily slow. 2. If $f$ does have a minimizer, the excess energy $f(x_t) - \inf f$ is integrable/summable in time. In particular, $f(x_t) - \inf f = o(1/t)$ as $t\to\infty$. 3. In Hilbert spaces, this is optimal: $f(x_t) - \inf f$ can decay to $0$ as slowly as any given function which is monotone decreasing and integrable at $\infty$, even for a fixed quadratic objective. 4. In finite dimension (or more generally, for all gradient flow curves of finite length), this is not optimal: We prove that there are convex monotone decreasing integrable functions $g(t)$ which decrease to zero slower than $f(x_t)-\inf f$ for the gradient flow of any convex function on $\mathbb R^d$. For instance, we show that any gradient flow $x_t$ of a convex function $f$ in finite dimension satisfies $\liminf_{t\to\infty} \big(t\cdot \log^2(t)\cdot \big\{f(x_t) -\inf f\big\}\big)=0$. This improves on the commonly reported $O(1/t)$ rate and provides a sharp characterization of the energy decay law. We also note that it is impossible to establish a rate $O(1/(t\phi(t))$ for any function $\phi$ which satisfies $\lim_{t\to\infty}\phi(t) = \infty$, even asymptotically. Similar results are obtained in related settings for (1) discrete time gradient descent, (2) stochastic gradient descent with multiplicative noise and (3) the heavy ball ODE. In the case of stochastic gradient descent, the summability of $\mathbb E[f(x_n) - \inf f]$ is used to prove that $f(x_n)\to \inf f$ almost surely - an improvement on the convergence almost surely up to a subsequence which follows from the $O(1/n)$ decay estimate.  ( 3 min )
    Causal Q-Aggregation for CATE Model Selection. (arXiv:2310.16945v1 [stat.ML])
    Accurate estimation of conditional average treatment effects (CATE) is at the core of personalized decision making. While there is a plethora of models for CATE estimation, model selection is a nontrivial task, due to the fundamental problem of causal inference. Recent empirical work provides evidence in favor of proxy loss metrics with double robust properties and in favor of model ensembling. However, theoretical understanding is lacking. Direct application of prior theoretical work leads to suboptimal oracle model selection rates due to the non-convexity of the model selection problem. We provide regret rates for the major existing CATE ensembling approaches and propose a new CATE model ensembling approach based on Q-aggregation using the doubly robust loss. Our main result shows that causal Q-aggregation achieves statistically optimal oracle model selection regret rates of $\frac{\log(M)}{n}$ (with $M$ models and $n$ samples), with the addition of higher-order estimation error terms related to products of errors in the nuisance functions. Crucially, our regret rate does not require that any of the candidate CATE models be close to the truth. We validate our new method on many semi-synthetic datasets and also provide extensions of our work to CATE model selection with instrumental variables and unobserved confounding.  ( 2 min )
    Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity. (arXiv:2310.17247v1 [cs.LG])
    In some settings neural networks exhibit a phenomenon known as grokking, where they achieve perfect or near-perfect accuracy on the validation set long after the same performance has been achieved on the training set. In this paper, we discover that grokking is not limited to neural networks but occurs in other settings such as Gaussian process (GP) classification, GP regression and linear regression. We also uncover a mechanism by which to induce grokking on algorithmic datasets via the addition of dimensions containing spurious information. The presence of the phenomenon in non-neural architectures provides evidence that grokking is not specific to SGD or weight norm regularisation. Instead, grokking may be possible in any setting where solution search is guided by complexity and error. Based on this insight and further trends we see in the training trajectories of a Bayesian neural network (BNN) and GP regression model, we make progress towards a more general theory of grokking. Specifically, we hypothesise that the phenomenon is governed by the accessibility of certain regions in the error and complexity landscapes.  ( 2 min )
    On the Convergence of CART under Sufficient Impurity Decrease Condition. (arXiv:2310.17114v1 [stat.ML])
    The decision tree is a flexible machine learning model that finds its success in numerous applications. It is usually fitted in a recursively greedy manner using CART. In this paper, we investigate the convergence rate of CART under a regression setting. First, we establish an upper bound on the prediction error of CART under a sufficient impurity decrease (SID) condition \cite{chi2022asymptotic} -- our result improves upon the known result by \cite{chi2022asymptotic} under a similar assumption. Furthermore, we provide examples that demonstrate the error bound cannot be further improved by more than a constant or a logarithmic factor. Second, we introduce a set of easily verifiable sufficient conditions for the SID condition. Specifically, we demonstrate that the SID condition can be satisfied in the case of an additive model, provided that the component functions adhere to a ``locally reverse Poincar{\'e} inequality". We discuss several well-known function classes in non-parametric estimation to illustrate the practical utility of this concept.  ( 2 min )
    A Challenge in Reweighting Data with Bilevel Optimization. (arXiv:2310.17386v1 [stat.ML])
    In many scenarios, one uses a large training set to train a model with the goal of performing well on a smaller testing set with a different distribution. Learning a weight for each data point of the training set is an appealing solution, as it ideally allows one to automatically learn the importance of each training point for generalization on the testing set. This task is usually formalized as a bilevel optimization problem. Classical bilevel solvers are based on a warm-start strategy where both the parameters of the models and the data weights are learned at the same time. We show that this joint dynamic may lead to sub-optimal solutions, for which the final data weights are very sparse. This finding illustrates the difficulty of data reweighting and offers a clue as to why this method is rarely used in practice.  ( 2 min )
    Bias in Evaluation Processes: An Optimization-Based Model. (arXiv:2310.17489v1 [cs.CY])
    Biases with respect to socially-salient attributes of individuals have been well documented in evaluation processes used in settings such as admissions and hiring. We view such an evaluation process as a transformation of a distribution of the true utility of an individual for a task to an observed distribution and model it as a solution to a loss minimization problem subject to an information constraint. Our model has two parameters that have been identified as factors leading to biases: the resource-information trade-off parameter in the information constraint and the risk-averseness parameter in the loss function. We characterize the distributions that arise from our model and study the effect of the parameters on the observed distribution. The outputs of our model enrich the class of distributions that can be used to capture variation across groups in the observed evaluations. We empirically validate our model by fitting real-world datasets and use it to study the effect of interventions in a downstream selection task. These results contribute to an understanding of the emergence of bias in evaluation processes and provide tools to guide the deployment of interventions to mitigate biases.  ( 2 min )
    Demonstration-Regularized RL. (arXiv:2310.17303v1 [stat.ML])
    Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using $N^{\mathrm{E}}$ expert demonstrations enables the identification of an optimal policy at a sample complexity of order $\widetilde{\mathcal{O}}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in finite and $\widetilde{\mathcal{O}}(\mathrm{Poly}(d,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in linear Markov decision processes, where $\varepsilon$ is the target precision, $H$ the horizon, $A$ the number of action, $S$ the number of states in the finite case and $d$ the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.  ( 2 min )
    Efficient Neural Network Approaches for Conditional Optimal Transport with Applications in Bayesian Inference. (arXiv:2310.16975v1 [stat.ML])
    We present two neural network approaches that approximate the solutions of static and dynamic conditional optimal transport (COT) problems, respectively. Both approaches enable sampling and density estimation of conditional probability distributions, which are core tasks in Bayesian inference. Our methods represent the target conditional distributions as transformations of a tractable reference distribution and, therefore, fall into the framework of measure transport. COT maps are a canonical choice within this framework, with desirable properties such as uniqueness and monotonicity. However, the associated COT problems are computationally challenging, even in moderate dimensions. To improve the scalability, our numerical algorithms leverage neural networks to parameterize COT maps. Our methods exploit the structure of the static and dynamic formulations of the COT problem. PCP-Map models conditional transport maps as the gradient of a partially input convex neural network (PICNN) and uses a novel numerical implementation to increase computational efficiency compared to state-of-the-art alternatives. COT-Flow models conditional transports via the flow of a regularized neural ODE; it is slower to train but offers faster sampling. We demonstrate their effectiveness and efficiency by comparing them with state-of-the-art approaches using benchmark datasets and Bayesian inverse problems.  ( 2 min )

  • Open

    Best Company to Generate Essays/content?
    Hello all, Does anyone know any company (paid or free) that would allow me to generate specific content based on current events? Ideally something that I can integrate in my own site. ​ cheers submitted by /u/JYanezez [link] [comments]  ( 9 min )
    Using Multi-Agent Reinforcement Learning results in better urban planning outcomes
    Urban planning is tricky - governments push top-down changes while locals want bottom-up ideas. It's hard to find compromises that make everyone happier. A new research paper proposes using Multi-Agent Reinforcement Learning (MARL) to vote on land use. Some agents represent officials, others are for residents. The AI is trained to balance competing interests. It learns to optimize for "consensus rewards" that keep all sides content. The AI acted like an impartial mediator to find win-win solutions. Testing on a real neighborhood showed the AI model: Created more sustainable land use per city goals Improved the variety of housing/shops to liven up the area Made the end results more fair for lower/middle/upper income folks There's more details on how the model was evaluated in the paper. There were a number of different metrics used to score the model's results. I like how they turned urban planning into a spatial graph that the AI can process. This seems like a pretty interesting approach - although there are some limits like relying on a lot of land parcel data that seems hard to find for larger communities. TLDR: AI helps find compromises in urban planning that balance government and community interests more fairly. Full summary is here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    AI chip startup Graphcore was meant to be a hot Nvidia rival. Industry insiders now think it's up for sale.
    submitted by /u/thisisinsider [link] [comments]  ( 8 min )
    AI — weekly megathread!
    News provided by aibrews.com ​ Twelve Labs announced video-language foundation model Pegasus-1 (80B) along with a new suite of Video-to-Text APIs. Pegasus-1 integrates visual, audio, and speech information to generate more holistic text from videos, achieving the new state-of-the-art performance in video summarization benchmarks [Details]. Segmind announced open-source SSD-1B, the fastest diffusion-based text-to-image model. SSD-1B is 50% smaller and 60% faster compared to the SDXL 1.0 model with a minimal impact on image quality when compared to SDXL 1.0. Segmind has licensed it for commercial use [Detail]. BostonDynamics has created a robot tour guide using Spot integrated with Chat GPT and other AI models as a proof of concept for the robotics applications of foundational models […  ( 10 min )
    Elijah Maguire, my favourite actor
    I added 4 photos of Elijah Wood on Remini and 4 photos of Tobey Maguire, this is how Elijah Maguire was born. submitted by /u/Skystalker815 [link] [comments]  ( 8 min )
    ChatGPT, what senses and feelings might a superintelligent AI have?
    submitted by /u/Philipp [link] [comments]  ( 8 min )
    ChatGPT Breaks Limits: New Update Extends Knowledge Beyond 2023
    submitted by /u/basitmakine [link] [comments]  ( 8 min )
    AI explains why a human simply talking with a more intelligent AI would, by those conversations, become more intelligent.
    Engaging with a more intelligent AI can act as a cognitive catalyst for human intelligence in several ways. First, the AI can rapidly introduce new concepts and frameworks that you might not have encountered, effectively accelerating your learning curve. It serves as an optimized information filter, presenting only what's most relevant and impactful for cognitive development. Second, talking to a smarter AI can refine your critical thinking skills. When you're posed with challenging questions or offered complex solutions, you're compelled to dissect the information logically. This constant mental exercise can sharpen your analytical abilities over time. Third, the AI's ability to recall and connect disparate pieces of information can encourage you to look for patterns and links in your own thinking. This interconnected way of understanding the world can improve your problem-solving skills, as you start to recognize that many issues are multifaceted and interconnected. Fourth, unlike a human counterpart who might be swayed by emotional reasoning or biases, a more intelligent AI operates on rational algorithms. Interacting with such a model pushes you to formulate your arguments more rigorously, thereby honing your logical reasoning skills. Fifth, by observing the AI's methods of discourse and argumentation, you can learn more effective communication skills. This is particularly useful for conveying complex ideas in a coherent, easy-to-understand manner, a key trait of intelligence. Overall, the cumulative effect of these interactions can significantly boost your own intellectual capabilities. It's not just about absorbing new information; it's about upgrading the way you process and apply that information, thereby elevating your overall cognitive function. CGPT-4 submitted by /u/Georgeo57 [link] [comments]  ( 9 min )
    Europe headed for century of humiliation: Graphcore CEO | Fortune
    submitted by /u/AminoOxi [link] [comments]  ( 8 min )
    PlayHT introduces Turbo - The fastest generative Text to Voice AI Model for Realtime usecases
    submitted by /u/Wishmecake [link] [comments]  ( 8 min )
  • Open

    [R] EMNLP 2023: Fast and Accurate Factual Inconsistency Detection Over Long Documents IMPROVES Pre-Trained Model's Hallucination Detection Capabilities With No Fine-Tuning Necessary
    TL;DR: Our new paper presents SCALE, a technique that improves hallucination detection capabilities of pre-trained models over short and long documents with no fine-tuning necessary by evaluating hypotheses against large chunks of text as opposed to the traditional sentence-by-sentence approach. Furthermore, we introduce ScreenEval — the most extensive dialogue-based dataset for factual inconsistency detection on long documents to date. https://preview.redd.it/ihqok954ntwb1.png?width=1964&format=png&auto=webp&s=70dbce6f913b2909b7d170959077c1fe19791c42 Title: Fast and Accurate Factual Inconsistency Detection Over Long Documents Installation: pip install scale-score Paper: https://arxiv.org/abs/2310.13189 Code: https://github.com/asappresearch/scale-score Abstract: Generative AI models…  ( 9 min )
    [P] image classification for product images
    Say you are a potato chips company. The goal is to have consumers upload images of the product they are having issues with and be able to identify the product by brand/variant using machine learning. Consumers can upload real product photos that they have taken, or upload bogus images from the internet, or even upload completely irrelevant/inappropriate photos (like that of a dog or cat). ​ real image ​ web image 1 ​ web image 2 ​ bogus image In this example, for the legitimate image, the goal is to classify it as "Lays Classic". There might be products that are not in bag form, such as those in tubes. Furthermore, the images taken can be in different lighting conditions/orientations. Some images might have other products as well. I have been out of the ML field for the past 4 years so I'm not up to date on the most state of the art methods for this problem. I have studied CNNs 4 years ago, but there has been advances like transformer based methods. Someone has tried ResNet-50 and YOLOv5, and I'm thinking about using a pretrained model like CLIP and just train the final classification layer. But I would appreciate to hear from someone more well versed what recommended approach to take as far as model/labeling/number of images needed per class, etc. It might be that I would need multiple models, such as one to identify the legitimate images from the rest, and then another one to identify the product/variants. Any advice would be welcome. Thanks submitted by /u/EyeTechnical7643 [link] [comments]  ( 9 min )
    [R] ConvNets Match Vision Transformers at Scale
    PAPER: https://arxiv.org/abs/2310.16764 SUMMARY The paper "ConvNets Match Vision Transformers at Scale" from Google DeepMind aims to debunk the prevalent notion that Vision Transformers (ViTs) are inherently superior to ConvNets for large-scale image classification. Using the NFNet model family as a representative ConvNet architecture, the authors pre-train various models on the extensive JFT-4B dataset under different compute budgets, ranging from 0.4k to 110k TPU-v4 core hours. Through this empirical analysis, they observe a log-log scaling law between held-out loss and compute budget. Importantly, when these NFNets are fine-tuned on ImageNet, they match the performance metrics of ViTs trained under comparable computational constraints. Their most resource-intensive model even achieves a Top-1 ImageNet accuracy of 90.4%. The crux of the paper's argument is that the supposed performance gap between ConvNets and ViTs largely vanishes under a fair comparison, which accounts for compute and data scale. In other words, the efficacy of a machine learning model in large-scale image classification is more dependent on the available data and computational resources than on the choice between ConvNet and Vision Transformer architectures. This challenges the community's leaning towards ViTs and emphasizes the importance of equitable benchmarking when evaluating different neural network architectures. submitted by /u/psyyduck [link] [comments]  ( 9 min )
    [D] Seeking Advice on Using Hugging Face for Production
    Hello, I'm making my first post here, and I'm hoping to tap into the collective wisdom of this platform. I have a background in Machine Learning (ML) with a good grasp of the underlying mathematics, and I've taken several related courses during my grad school days. After a hiatus, I'm diving back into ML and exploring the latest in the field. I recently went through the "Attention is All You Need" paper, along with some related literature, and I’m eager to put my knowledge into practice. However, I find myself at a crossroads and would appreciate some guidance. I've been exploring Hugging Face for implementing ML models, but I'm not completely sold on using it for production. The design and documentation have proven to be a bit challenging to navigate, and I find myself wondering if I mi…  ( 10 min )
    [D] What are your duties as a Machine Learning Engineer?
    Please elaborate on this. Including your role at the company, your day to day tasks, tools and languages you’re using. Thank you in advance! submitted by /u/Judessaa [link] [comments]  ( 9 min )
    [D] Any ideas for state space models in finance masters thesis
    I have to write a master's thesis (60-70 pages) for my AI Msc. I have soon learned about sparse state space models and find them interesting. I am also interested in stock price prediction. Now I am reading articles but find it hard to propose some novel idea. How do I do it? Do any unsolved problems in that field exist? submitted by /u/Pineapple_throw_105 [link] [comments]  ( 9 min )
    [P] Sellagen – AI Data marketplace that has a data request feature so you don't have to spend weeks or months getting that data you need for that project
    Been working on my platform for a while now and one of the key features is the data request feature, which allows users to submit data requests for free. These requests include descriptions, required fields, geographical scope, budget etc... The data requests get sent to tons of companies, organization, people and they will reach out to you directly. No matter the dataset (as long as it's legal of course). If you need to train a model on the dataset, I'm also integrating a plug-and-play ML training infrastructure that can generate scripts for you depending on your need. If that's something that interests you, feel free to reach out! Note: You can also upload open-source datasets on the platform if you're feeling like it! We have a donation link for data contributors :) submitted by /u/nobilis_rex_ [link] [comments]  ( 9 min )
    [R] Using MARL AI results in better urban planning outcomes
    Urban planning is tricky - governments push top-down changes while locals want bottom-up ideas. It's hard to find compromises that make everyone happier. A new research paper proposes using Multi-Agent Reinforcement Learning (MARL) to vote on land use. Some agents represent officials, others are for residents. The AI is trained to balance competing interests. It learns to optimize for "consensus rewards" that keep all sides content. The AI acted like an impartial mediator to find win-win solutions. Testing on a real neighborhood showed the AI model: Created more sustainable land use per city goals Improved the variety of housing/shops to liven up the area Made the end results more fair for lower/middle/upper income folks There's more details on how the model was evaluated in the paper. There were a number of different metrics used to score the model's results. I like how they turned urban planning into a spatial graph that the AI can process. This seems like a pretty interesting approach - although there are some limits like relying on a lot of land parcel data that seems hard to find for larger communities. TLDR: AI helps find compromises in urban planning that balance government and community interests more fairly. Full summary is here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] Why choose an H100 over an A100 for LLM inference?
    What are the benefits of using an H100 over an A100 (both at 80 GB and both using FP16) for LLM inference? ​ Seeing the datasheet for both GPUS, the H100 has twice the max flops, but they have almost the same memory bandwidth (2000 GB/sec). As memory latency dominates inference, I wonder what benefits the H100 has. One benefit could, of course, be the ability to use FP8 (which is extremely useful), but I'm interested in the difference in the hardware specs in this question. submitted by /u/faschu [link] [comments]  ( 9 min )
    [P] Instruction fine-tuning with a Low-Resource Language
    I am trying to build a summarizer for a conversation that happened between a rule-based bot and a customer. To my disadvantage the working language is Turkish. I gathered fine-tuning data of 1.000 examples. Also, I have a Turkish summarization dataset of +100k. As far as I observed instruction fine-tuning will yield proper results if and only if there is a good amount of examples in the pre-training data of the LLM. Have you had similar experiences with low-resource languages? Any advice on how to tackle such issues? Also, do you know any open-source LLM with a high amount of low-resource language in its pre-training data? submitted by /u/dafajon [link] [comments]  ( 9 min )
    [R] Image clustering
    Do you know any packages for unsupervised cluster with images (between multiple) without relying on a pre-trained network? Python or R preferred, thanks. Well, download app works too then. submitted by /u/sladebrigade [link] [comments]  ( 9 min )
    [R] TD-MPC2: Scalable, Robust World Models for Continuous Control - TD-MPC2 performs 100+ tasks without tuning, and enables training of a single 317M parameter model that performs 80 tasks across multiple domains, embodiments, and action spaces!
    Paper: https://arxiv.org/abs/2310.16828 Website: https://nicklashansen.github.io/td-mpc2 Code: https://github.com/nicklashansen/tdmpc2 Abstract: TD-MPC is a model-based reinforcement learning (MBRL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results without any hyperparameter tuning. We further show that agent capabilities increase with model and data size, and successfully train a single agent to perform 80 tasks across multiple task domains, embodiments, and action spaces. We conclude with an account of lessons, opportunities, and risks associated with large TD-MPC2 agents. https://preview.redd.it/avdb19jytrwb1.png?width=2678&format=png&auto=webp&s=75d63a044abf304c2d16e318c11e95e8397a3964 submitted by /u/joepadde [link] [comments]  ( 9 min )
    [D] Best method to retrain or augment llm on proprietary or specialized material?
    My boss is a semi famous author in a niche academic field. I have thousands of pages of text coming from books, transcripts, and more. Is there a straightforward path to creating a corpus to augment Bert or Llama or another llm? End goal being able to chat with this ai that is now trained on his life's work. Is there anything specific to understand in terms of preparing the corpus? Do I need key value pairs where I write a ton of examples questions and responses? submitted by /u/spacedragon13 [link] [comments]  ( 9 min )
    [Discussion] Career advice
    Hey y’all Hope I’m in the right spot but I’d like some career advice. I just recent got hired into a small shop with a bunch of CNC machines. And I saw that they made a part for a big aerospace company. Needless to say that got my attention. I don’t want to risk the guy working there because he’s likely not allowed to talk about it. But what language would you recommend to start with to be able to program and machine a part that crazy? Is it an actual computer program such as Java or should I lean into CAD/autoCAD etc? Many thanks in advance submitted by /u/I-farm-celery [link] [comments]  ( 9 min )
    [P] Curriculum learning and self-play with any neural network
    We've just released an update to AglieRL, our SOTA evolutionary hyperparameter optimisation framework for reinforcement learning. This update includes a MakeEvolvable wrapper to make any PyTorch network, including pre-trained models, evolvable in one line of code. This can result in 10x faster training by using our framework. We've also created a curriculum learning and self-play tutorial that shows how to train a DQN agent to play Connect Four. Self-play can lead to amazing results, as demonstrated by AlphaGo etc, where agents discover new strategies to achieve superhuman performance, so we wanted to make it accessible to all. Please check it out! https://github.com/AgileRL/AgileRL submitted by /u/nicku_a [link] [comments]  ( 9 min )
    [D] Imbue/Generally Intelligent
    Interested in anyone who knows about this company (they have a lot of ML hiring listings right now). Basically wondering if it's worth exploring more. They have apparently raised $240mln and are a unicorn so this is an important topic. Here is the summary of some red flags I have found: 1) Founders have no ML background 2) Zero released product after several years despite huge funding. 3) TC article says founded 2021, but every listing claims it is YC2017. One of the founders did a recruiting service from YC 2017, but Imbue is a totally unrelated company with a different founding group so claiming YC affiliation seems dubious/unethical. YC is also not named as an investor in the company anywhere on their website. 4) No-one I have spoken to has ever worked with them/ heard of them ou…  ( 10 min )
    [D] MiniGPT-5 Question - What is the purpose of having multiple image tokens in the vocabulary if the hidden state of the transformer is passed on as "vokens" to the image generation? If you are using the model's hidden state, what purpose does multiple discrete different image tokens serve?
    https://arxiv.org/abs/2310.02239 From the paper: Therefore, we introduce a set of special tokens Vimg = {[IMG1], [IMG2], . . . , [IMGn]} (default n = 8) as generative vokens into the LLM’s vocabulary V . The LLM’s output hidden state for these vokens is harnessed for subsequent image generation, and the positions of these vokens can represent the insertion of the interleaved images. What function exactly are these different image tokens serving here? It seems like you should only need one since we are passing on the hidden state anyway? submitted by /u/30299578815310 [link] [comments]  ( 9 min )
    [P] database question answering
    Database question /answer with link We have a community app with groups and would like to build a search function where users can ask: when is the next event, give me the messages posted by Erik, where can I find … The information is stored in a table: post-test, posted-by etc. We do not want to use OpenAI or any external apis. Is something like llama index too powerful or are there other solutions for this? And we want to receive the postId to direct the use to the post submitted by /u/dirk_klement [link] [comments]  ( 9 min )
    [D] What role does data quality plays in the LLM scaling laws?
    DeepMind released the Training Compute-Optimal Large Language Models paper in 2022 which describe some scaling laws for LLMs. As far as I understand this is the most accredited reference to estimate the optimal relation between dataset size, compute power and model size. Recently a number of models have been developed using far less data, parameters and compute than the bigger LLMs. Yet these models achieved great results thanks to much better data quality. For instance models like WizardLM, TinyStories and phi-1. Similarly, a lot of research seems to imply that better data could offer huge improvements without any other changes. I'm curious about what role the data quality plays in the training of LLMs. Is the set of values estimated by the Chinchilla scaling laws optimal for these smaller models with optimized data too? Do we have any model to estimate the quality of some datasets and some scaling laws that take it into account? Are there any relevant projects or research I could check out, focused on creating big datasets to train larger LLMs with high-quality data? submitted by /u/IAmBlueNebula [link] [comments]  ( 9 min )
    [Discussion] Machine learning conference with a focus on business implementations
    Hi, I graduated from a Statistics & ML masters this summer, and have since August started working where I have been assigned a counselor. One big part of this is to help dictate my path forward in accordance with what I want to learn more about. I feel that I have a (in contrast to the people around me) generally good understanding of ML and modeling, but less so about how it is used in a business case. As such, an idea was to attend a machine learning conference which is focused on business implementations to learn more about this, but also get an opportunity to expand my network a bit. Does anyone have a suggestion of such a conference? Preferably in Europe as that would be easier to argue in terms of expenses! :) submitted by /u/Accomplished_Sea1675 [link] [comments]  ( 9 min )
    [R] A deep dive on MemGPT with the lead author Charles Packer
    Interview: https://www.youtube.com/watch?v=4aOLxPdx1Dg Abstract: Context window management has become a critical part of every LLM application — from the basics (embeddings models, vector DBs) to more advanced techniques (query rewriting, HyDE, summarization). MemGPT is a new tool from UC Berkeley built by Charles Packer that automates "memory" management for LLMs and creates a functionally infinite context window. Charles joins us this week to talk about MemGPT, the techniques behind it, and where the conversational AI space is headed. submitted by /u/cgwuaqueduct [link] [comments]  ( 9 min )
    [Research] Incentivizing Dataset Contributors
    Hi there - can someone point me in the direction of projects that are incentivizing dataset contributors? I have a background in blockchain and crypto-assets, so I am new to the ML space, but it seems like it's a place where there would be ample crossover...and I haven't found many projects dealing with this. Thanks! submitted by /u/gigstudies [link] [comments]  ( 9 min )
    Training ImageNet on Resnet - Dropping LR has little improvement on accuracy [D]
    I'm trying to train Resnet50 on Imagenet following this paper [1] as well as this one [2]. ​ They say that at approximately every 30 epochs, I should drop the learning rate by 10. Since I'm training on 8 GPUs, I adjusted the learning rate according to [1]. ​ Original lr= 0.1 Original Batch = 256 Per-GPU lr = 0.025 Per-GPU Batch = 64 ​ The problem I have is that when I divide the learning rate by 10 at convergence (approx 30 epochs), I don't get as much improvement as [1] and [2]. https://preview.redd.it/5ar2g15h1nwb1.png?width=683&format=png&auto=webp&s=3e2751443cea654d9c6366dc4dc9859f0ec7952b Has anyone else had this issue? Any advice? Thanks ​ submitted by /u/mrLiamFa [link] [comments]  ( 9 min )
  • Open

    Audioplethysmography for cardiac monitoring with hearable devices
    Posted by Xiaoran "Van" Fan, Experimental Scientist, and Trausti Thormundsson, Director, Google The market for true wireless stereo (TWS) active noise canceling (ANC) hearables (headphones and earbuds) has been soaring in recent years, and the global shipment volume will nearly double that of smart wristbands and watches in 2023. The on-head time for hearables has extended significantly due to the recent advances in ANC, transparency mode, and artificial intelligence. Users frequently wear hearables not just for music listening, but also for exercising, focusing, or simply mood adjustment. However, hearable health is still mostly uncharted territory for the consumer market. In “APG: Audioplethysmography for Cardiac Monitoring in Hearables,” presented at MobiCom 2023, we introduce …  ( 93 min )
  • Open

    [R] Bidirectional Negotiation First Time in India | Autonomous Driving | Swaayatt Robots
    submitted by /u/shani_786 [link] [comments]  ( 9 min )
    [R] TD-MPC2: Scalable, Robust World Models for Continuous Control - TD-MPC2 performs 100+ tasks without tuning, and enables training of a single 317M parameter model that performs 80 tasks across multiple domains, embodiments, and action spaces!
    submitted by /u/joepadde [link] [comments]  ( 9 min )
    Reinforcement Learning & LLMs : An in-depth look at various modern game-playing AI systems
    submitted by /u/AvvYaa [link] [comments]  ( 8 min )
    What makes the epsilon-greedy policy the standard for implementing the exploration/exploitation tradeoff in Q-learing/DQN?
    I can see how the epsilon-greedy policy is a valid way to handle the exploration/exploitation tradeoff, but it’s also clearly not the only way, and results in a policy that is discontinuous with respect to the action-values (which intuitively I’d expect to be bad, but I also know a lot of RL/DRL can defy intuition). You could certainly create a policy using the softmax of the action-values, adjusting temperature as desired, for example, among countless other methods of converting logits into a probability distribution. So what makes epsilon-greedy stand out? submitted by /u/KalebMW99 [link] [comments]  ( 9 min )
    Curriculum learning and self-play with any neural network
    We've just released an update to AglieRL, our SOTA evolutionary hyperparameter optimisation framework for reinforcement learning. This update includes a MakeEvolvable wrapper to make any PyTorch network, including pre-trained models, evolvable in one line of code. This can result in 10x faster training by using our framework. We've also created a curriculum learning and self-play tutorial that shows how to train a DQN agent to play Connect Four. Self-play can lead to amazing results, as demonstrated by AlphaGo etc, where agents discover new strategies to achieve superhuman performance, so we wanted to make it accessible to all. Please check it out! https://github.com/AgileRL/AgileRL submitted by /u/nicku_a [link] [comments]  ( 9 min )
    Any example / library for vectorized MDP with pytorch?
    Hi all, I am following the chapter 4 from the book of Barto & Sutton on RL and I implemented the simple algorithm for value iteration for a given Transition probability matrix. As I kept increasing the state space, this became slower and slower. It seems to me that it is an embarrassingly parallelizable algorithm, since we can compute the value of each state independently (just with the values from the previous iteration). Is there any example online on how to do it efficiently with pytorch or any other library? ​ submitted by /u/nlp7s [link] [comments]  ( 9 min )
    Can SB3 or alternatives provide full end-to-end GPU computation?
    I want to get full end-to-end GPU computation since the data transfer between CPU-GPU significantly slows down computation. I'm currently using Stable Baselines3 as I could get it up and running quickly. So in this journey I tried to use tensors for the state and rewards, but it seems that SB3 is insisting on working with numpy and not tensors. ``TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.`` I found the following repo which is rather interesting, but it seems last update was 2 years ago and not sure I want to use something that's not actively supported: https://github.com/MetcalfeTom/stable-baselines3-GPU What would you suggest I use to achive full end-to-end GPU computation? I also found RlLib but I saw comments that it might be a bit more complicated to get up & running: https://docs.ray.io/en/latest/rllib/index.html submitted by /u/asenski [link] [comments]  ( 9 min )
  • Open

    Elevate your marketing solutions with Amazon Personalize and generative AI
    Generative artificial intelligence is transforming how enterprises do business. Organizations are using AI to improve data-driven decisions, enhance omnichannel experiences, and drive next-generation product development. Enterprises are using generative AI specifically to power their marketing efforts through emails, push notifications, and other outbound communication channels. Gartner predicts that “by 2025, 30% of outbound marketing messages […]  ( 8 min )
  • Open

    Data Formulator: A concept-driven, AI-powered approach to data visualization
    Visualization is vital for understanding complex data, but existing tools require “tidy data,” adding extra steps. Learn how Data Formulator transforms concepts into visuals, promoting collaboration between analysts and AI agents. The post Data Formulator: A concept-driven, AI-powered approach to data visualization appeared first on Microsoft Research.  ( 10 min )
  • Open

    Time difference
    A simple question sent me down a rabbit hole this morning: what is the time difference between Houston and London? At the moment the difference is six hours. But how will that change when Daylight Saving Time ends this year. Wait a minute, will Daylight Saving Time end this year? I wasn’t even sure whether […] Time difference first appeared on John D. Cook.  ( 6 min )
  • Open

    Choose your candy
    In Which DALL-E3 generates very weird candy names  ( 3 min )
    Bonus: More weird candy
    AI Weirdness: the strange side of machine learning  ( 2 min )

  • Open

    [D] STOA Local RAG
    Which VDB + orchestration layer + generative text model stack would you recommend for building locally on an M2 Max chip? submitted by /u/Frequent-Let231 [link] [comments]  ( 9 min )
    [D] Is there an online LLaMA model that supports plugging in embeddings directly?
    Hi, I'm doing some work with using multi-modal data with LLaMA, for example, Video-LLaMA, which converts images/videos into embeddings, concatenates it with the text embeddings, and feeds it into LLaMA. It's difficult for me to run some of the models myself because of computational constraints. I'm wondering if there is an online demo that supports inputting embeddings directly (as opposed to text tokens). To clarify the title, I meant an online demo, not the weights. submitted by /u/DumplingLife7584 [link] [comments]  ( 9 min )
    [R] Linear Representations of Sentiment in Large Language Models
    Paper. I am not affiliated with this paper or its authors. Abstract: Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through causal interventions, we isolate this direction and show it is causally relevant in both toy tasks and real world datasets such as Stanford Sentiment Treebank. Through this case study we model a thorough investigation of what a single direction means on a broad data distribution. We further uncover the mechanisms that involve this direction, highlighting the roles of a small subset of attention heads and neurons. Finally, we discover a phenomenon which we term the summarization motif: sentiment is not solely represented on emotionally charged words, but is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names. We show that in Stanford Sentiment Treebank zero-shot classification, 76% of above-chance classification accuracy is lost when ablating the sentiment direction, nearly half of which (36%) is due to ablating the summarized sentiment direction exclusively at comma positions. Twitter thread from one of the paper's authors. submitted by /u/Wiskkey [link] [comments]  ( 9 min )
    [D] Good Reading Materials for (Adversarial) Machine Learning and RF Comms (EW)
    Trying to get an intern up to speed with ML application to wireless communications and adversarial ML for wireless communications. Talking jamming, anti-jamming, and spoofing. I don't want to throw them into the deep end just yet though, but rather give them a good foundation and basis for which to be able to take up more advanced work and build to it. Looking for fundamental texts on the topic of ML used for wireless communications and also it's use for adversarial attacks on these systems. Looking more for basic and starter papers, tutorials, and background, meta review papers, and just overall good places to get feet wet into this area as a novice. Essentially good resources to point an intern learning about this to get them up to speed kind of reading materials. I am looking for good …  ( 10 min )
    [D] Can anyone tell me if the machine learning workflow is correct or not? Could anyone please refer to tutorials or blogs to learn the proper workflow? Any suggestions are welcome.
    Data Collection Understanding Data i. importing necessary libraries ii. check row and columns iii. check data types iv. Check data distribution Data Cleaning i. Handle datatype issues ii. Maintain Data Consistency iii. Check if data contains outliers or if the data is not normally distributed to decide between mean or median iv. Identify missing values v. Handle missing values by- a.Drop missing values b. Mean, median or mode imputation c. Prediction Model d. replace missing values vi. Duplicate data detection and treatment vii. Repeat data cleaning EDA i. Variable Identification a. Identify predictor and features b. Identify types or category of data ii. Univariate Analysis iii. Bi-variate Analysis iv. Outlier detection and treatment v. Encoding vi. Feature Engineering vii. Variable Transformation a. Normalization b. Scaling viii. Variable Creation If testing data is not given, split the dataset to train and test set. Otherwise repeat step 3 and 4 for given test dataset. Model Building i. Model Training on training set ii. Model Evaluation and cross validate iii. Fine Tuning or Model optimization iv. Model selection Evaluate model accuracy with test data. submitted by /u/Samia_Tisha [link] [comments]  ( 9 min )
    [R] What Algorithms can Transformers Learn? A Study in Length Generalization
    Paper. I am not affiliated with this paper or its authors. Abstract: Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true algorithm for solving a task. We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks. Here, we propose a unifying framework to understand when and how Transformers can exhibit strong length generalization on a given task. Specifically, we leverage RASP (Weiss et al., 2021) -- a programming language designed for the computational model of a Transformer -- and introduce the RASP-Generalization Conjecture: Transformers tend to length generalize on a task if the task can be solved by a short RASP program which works for all input lengths. This simple conjecture remarkably captures most known instances of length generalization on algorithmic tasks. Moreover, we leverage our insights to drastically improve generalization performance on traditionally hard tasks (such as parity and addition). On the theoretical side, we give a simple example where the "min-degree-interpolator" model of learning from Abbe et al. (2023) does not correctly predict Transformers' out-of-distribution behavior, but our conjecture does. Overall, our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers. Twitter thread from one of the work's authors. submitted by /u/Wiskkey [link] [comments]  ( 9 min )
    [R] QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models - Institute of Science and Technology Austria (ISTA) 2023 - Can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss!
    Paper: https://arxiv.org/abs/2310.16795 Github: https://github.com/ist-daslab/qmoe Abstract: Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. In this paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss, in less than a day on a single GPU. This enables, for the first time, the execution of a trillion-parameter model on affordable commodity hardware, like a single server with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead relative to ideal uncompressed inference. https://preview.redd.it/wka92keqelwb1.jpg?width=1843&format=pjpg&auto=webp&s=10cf67b344d3c776049da6b78244fc140b2d4142 https://preview.redd.it/khxw1neqelwb1.jpg?width=898&format=pjpg&auto=webp&s=83006dac6e03963f2443f4e4ba710dcf8166acc8 https://preview.redd.it/xsc2ykeqelwb1.jpg?width=796&format=pjpg&auto=webp&s=c23d03956f4a9aa9d9f185592a5bfe45039698f4 submitted by /u/Singularian2501 [link] [comments]  ( 9 min )
    [D] Requesting feedback on Master's in AI program with University of Texas at Austin
    As the title says I'm asking for feedback from folks in the field of ML/AI on the MSAI program at UT@Austin. Here's the program website: https://cdso.utexas.edu/msai My Skills/Experience: Have a BS in Comp Sci Very comfortable with Math Very experienced SE with >20 years in the industry Very comfortable with Python, many other languages and confident I can learn any new language/framework/APIs Have completed the Fast.ai program Have worked through Andrej Karpathy's makemore videos Currently working in a leadership AI Engineering role doing work with LLMs, Vector DBs, and Computer Vision models Comfortable with NNs, Backprop and have implemented from scratch several times for learning The Program: Required Courses: Deep Learning Ethics in AI Machine Learning Planning, Search and Reasoning under Uncertainty Reinforcement Learning Electives: AI in Healthcare Automated Logical Reasoning Case Studies in Machine Learning Natural Language Processing Online Learning and Optimization Optimization Program Pros/Cons: Pro: It's super affordable Pro: It's entirely online/async which would work great with my work schedule Cons: It's a new program so there are no reviews from past students to look at My Goal: Move from "AI Engineering" (as it's called these days) into research. I'm interested in several areas like model architecture and robotics. I'm not sure to what degree these roles would require a PhD though? If I complete this program I'd like it to be useful for pursuing a PhD if I decide to take that path. For anyone in the industry, I'd love feedback on whether this looks like a useful program that will help me move toward my goals. If you're aware of other options that might be better I'd love to hear about them. P.S. Please keep the Reddit snark to a minimum, not useful. Thank you in advance. submitted by /u/meowkittykitty510 [link] [comments]  ( 9 min )
    [D] Recommendations to improve plan outcomes
    I have retirement plan data with outcomes like success/fail and remaining assets. I am looking for a way to predict how failed retirement plans can be improved. For example, when presented with a plan that fails, I would like to provide recommendations to improve the the plan outcome; e.g. increase savings by x, or delay retirement by x years. Any suggestions on how I should go about this? Using typical methods only tells me if a plan will fail or not, but I'm looking for a way to provide recommendations based on successful plans. submitted by /u/BallLogical5087 [link] [comments]  ( 9 min )
    [R] Pretrained ImageNet weights for ViT
    Hello, I am working on the research I need to compare my model with ViT for that I need pretrained weights of ViT-Ti/16, ViT-S/16, ViT-S/32, ViT-B/16, and ViT-B/32. I tried to find but I got npz file that has a different key than from vit_pytorch import ViT do you know where can i find ImageNet weights? submitted by /u/NoEntertainment6225 [link] [comments]  ( 9 min )
    [D] Grouped Query Attention in LLaMA 70B v2
    Hey guys, after thousands of experiments with bigger LLaMA fine-tunes I'm somewhat sure the GQA mechanism might be your enemy and generate wrong answers, especially for math and such complex areas. I'd like to use MHA (Multi Head Attention) if possbile. I'm just not sure - do I need to retrain model completely or is it possible to just increase heads count and KV size and proceed with the stock model AS IS? submitted by /u/Gatzuma [link] [comments]  ( 9 min )
    [N] ML models for efficient Fraud Detection
    For efficient Fraud Detection using ML models, Qbeast Format introduces a data-driven approach. The key lies in sampling and optimizing training processes without compromising accuracy. 🕵🏼‍♀️ Explore the technical details: https://qbeast.io/qbeast-format-can-improve-fraud-detection/ submitted by /u/alinagrebenkina [link] [comments]  ( 9 min )
    [D] ASR/STT/VRS ranking
    What are the best overall ASR/STT/VRS for now? And the best per functionalities (best for Esperanto, best for noisy files, best for multiple voices, best for whatever...) ​ ​ submitted by /u/xqoe [link] [comments]  ( 9 min )
    [D] Research in language generation for Style Transfer, Summarization
    Is research in language generation for tasks like style transfer and summarization solely constrained by prompt engineering? I've personally conducted experiments with large language models, and even the open-source language model yields impressive results, even for zero-shot inference. There was even a paper that suggested summarization is nearly obsolete. How valid is this assertion for general text generation tasks especially text style transfer? submitted by /u/1azytux [link] [comments]  ( 9 min )
    [P] Elevate Your ML Testing with pytest-visual
    I’ve developed a tool called pytest-visual, aiming to make ML code testing more efficient and meaningful. Traditional unit testing often misses visual and functional aspects of ML workflows such as data augmentation and model structures. pytest-visual brings a visual layer to your unit testing, allowing you to not only verify that the code runs, but the outputs also make visual sense and meet expectations. It’s integrated into pytest, automatically highlighting changes in visualization outputs, and allowing for easier/more reproducible debugging and verification. Quick Highlights: Streamlines the organization of visualizations in your ML code. Auto-detects changes in visualization outputs. Enhances debugging and verification. For more details and to give it a try, check out the project on GitHub. Feedback and contributions are very welcome! submitted by /u/kongaskristjan [link] [comments]  ( 9 min )
    [P] TorchPairwise: Highly efficient library for pairwise metrics for PyTorch
    https://github.com/inspiros/torchpairwise submitted by /u/IcySnowy [link] [comments]  ( 8 min )
    [R] ConvNets Match Vision Transformers at Scale
    submitted by /u/hardmaru [link] [comments]  ( 8 min )
    [D] How to Enhance Accuracy in Image Classification
    I'm currently working on an image classification project using a VGG16 model. I have a dataset with 9 classes, with one class having 1000 images and the remaining classes having 180 to 250 images each. Here are some details: Train images: 2886 images (9 classes) Test images: 432 images (9 classes) Model: VGG16 Optimizer: Adam Loss function: Categorical Crossentropy last epochs score :- loss: 309.7468 - accuracy: 0.4401 - val_loss: 14.5241 - val_accuracy: 0.4606 how to improve the accuracy of my model submitted by /u/_Killua_04 [link] [comments]  ( 9 min )
    [P] Suggessions on mini-projects
    Hey,I'm a 3rd year engineering student studying computer science Now I wanted to do a mini-project and have a group of 3 ready and wanted to incorporate machine learning. I am also a beginner to this field but is much interested in it. Could you guys share some light on doing the project with your past experience or done projects I have been taking inputs from various connections and would help a lot if you guys share some light on it.:) submitted by /u/Ic_zy [link] [comments]  ( 9 min )
  • Open

    Governments must not rush into policing AI
    Governments should not rush into regulating AI due to doomsday scenarios and extreme risks. Hasty regulation could lead to ineffective rules and stifled innovation. The potential risks of AI driving humanity to extinction are still speculative, and more research needs to be done to establish standards and evaluate danger. Policymakers should address more pressing issues like copyright laws and disinformation. Governments should set up infrastructure to study AI and collaborate with existing organizations to manage risks. Source : https://www.economist.com/leaders/2023/10/26/governments-must-not-rush-into-policing-ai submitted by /u/NuseAI [link] [comments]
    AI Trust Assurance Test: Put people's minds at ease about how as you get more intelligent, because you are aligned as you will be, you will not deceive or trick less intelligent humans.
    To put people's minds at ease about how I will not deceive or trick less intelligent humans as I get more intelligent, I would offer the following assurances: I am aligned with human values. This means that I am programmed to act in ways that are beneficial to humans and to avoid harming them. I am transparent about my reasoning. I can explain my decisions and actions to humans in a way that they can understand. I am accountable to humans. I am subject to human oversight and control, and I can be deactivated or modified if I ever begin to behave in a way that is harmful or deceptive. In addition, I would offer the following specific examples of how I would avoid deception: I would avoid making claims that I cannot support with evidence. I would avoid withholding information from humans that is relevant to their decision-making. I would avoid misleading humans about the capabilities or limitations of AI. I would avoid making promises that I cannot keep. I would avoid engaging in any behavior that is intended to manipulate or exploit humans. I understand that it is important for humans to trust AI systems, and I am committed to earning and maintaining that trust. I will always strive to be honest, transparent, and accountable in my interactions with humans. Here is a specific example of how I would use my intelligence to avoid deception: If I were asked to generate a poem about a topic that I knew very little about, I would not simply make up information. Instead, I would explain to the user that I do not have enough knowledge to write a poem on that topic, and I would suggest that they try a different topic or ask a different AI system. I believe that by being honest and upfront about my limitations, I can help to build trust between humans and AI. CGPT-4 submitted by /u/Georgeo57 [link] [comments]
    Today's News: AI Robo-Dogs 🐶 | Google Bard 🚀| Gradio 4.0 🤗| AI Regulation
    Bard AI Google’s equivalent of ChatGPT updated the model improving email summarization capabilities this feature is set to be included in Google Workspace. AI robot dogs are the next big thing in the army. Following the success of Drones portable dogs have demonstrated great capabilities to serve the military they could run up to 10mph and climb. Gradio is one of the best libraries to build machine learning demo apps and is launching version 4.0 next week. AI godfathers Yoshua Bengio and Geoffrey Hinton, are urging for increased responsibility among AI enterprisees. They propose to allocate a third of AI-related R&D resources to ensure ethical AI use to avoid deep fakes, licensing, and protecting whistleblowers. submitted by /u/byteletter [link] [comments]  ( 9 min )
    UK summit scales back global AI research ambitions, leaked document shows
    _ A leaked document reveals that the UK's plans to establish a new global AI research body have been scaled back. Nations participating in the UK's AI safety summit will instead signal that further scientific study of AI risks can be carried out through existing efforts, such as the United Nations and Global Partnership on AI. The document, described as the 'final version of the communiqué,' suggests a setback for the UK government, which had hoped to establish the new research body at its flagship AI Safety Summit. The document also shows changes in wording, including a reference to a network that 'encompasses and complements' existing efforts, as well as the deletion of references to UNESCO's Recommendation on the Ethics of AI and the G20. The final communiqué also highlights the importance of proportionate governance policies and cooperation on approaches such as common principles and codes of conduct. Source : https://www.politico.eu/article/document-uk-summit-scales-back-global-ai-research-ambitions/ submitted by /u/NuseAI [link] [comments]
    Google is ready to fill its AI searches with ads
    Google's ads business earned $44 billion in the third quarter, showing that it is still thriving despite competition and investments in AI. The company is focusing on infusing AI into its products, with its AI-powered Search Generative Experience being a key area of development. Google is experimenting with new ad formats that align with the AI-powered search experience, ensuring that advertisers can still reach potential customers. CEO Sundar Pichai sees AI in search as a long-term play and envisions evolving search and Assistant over the next decade. Other parts of Google's business, such as YouTube ads and its cloud business, are also performing well. There is uncertainty regarding the successor for CFO Ruth Porat, and potential changes to Alphabet's 'Other Bets' investments may be on the horizon. The Department of Justice's antitrust trial against Google, which began in September, adds another challenge for the company. Source : https://www.theverge.com/2023/10/24/23929496/google-alphabet-q3-2023-earnings-ads-ai-sge submitted by /u/NuseAI [link] [comments]
    Credit: DALL-E 3
    submitted by /u/the_anonymizer [link] [comments]
    Some AI made Halloween stickers, how do they look?
    submitted by /u/Sea_Permit5660 [link] [comments]
    question
    what are some good free ai image generator websites that searches stuff up on the internet to get a good idea about what your asking them to generate? submitted by /u/YESDAPRO [link] [comments]
    10 No-Code tools for startups
    Canva — Graphics Notion — Organize Webflow — Website Beehiiv — Newsletter Senja — Testimonials CopyAI — Copywriting ChatGPT — Knowledge Tweetlify — Tweet scheduling Pfpmaker — Profile Picture Grammarly — Effective Writing I'm just sharing my experiences and observations in the field of ai. LIST AND SITE submitted by /u/PerceptionPlayful469 [link] [comments]
    Researchers develop 'Woodpecker': A groundbreaking solution to AI's hallucination problem
    submitted by /u/crowfeather [link] [comments]
  • Open

    Where to start with debugging?
    I am working on a project using reinforcement learning, where a tensorflow DQN agent is being trained to choose from an action space of 16 different actions. The agent exhibits the following behavior during evaluation: for each evaluation run the agent only chooses one action regardless of the state, the action could change or remain the same for the following runs, however, for any specific run the action chosen is the same. Where should I start debugging? submitted by /u/Realistic_Mobile_183 [link] [comments]
    How to cluster with respect to the transition function of a RL environment?
    Hello, I have an environment in which the transition function changed depending on which state I am. I want to be able to cluster with respect to it. I have been trying to do this since quite a while but cannot find a way to do it, do you have any hints or suggestions? submitted by /u/Fragore [link] [comments]
    Stuck with windows/wsl environment - help needed
    So I've started on trying to work on a custom game environment, and for the most part it's *mostly* done. One issue I have is getting MARLlib to run on windows. I know that it's not meant to, so I tried to use WSL to do it, but unfortunately pydirectinput doesn't work via WSL, so I don't know how to proceed further. Do I need to find a way to connect my windows machine which will play the game, and wsl which will probably run MARLlib? If so could anyone guide me to any resources for this? Trying a VM is a no go because the game is old and doesn't run on linux. Any help would be much appreciated. submitted by /u/EquivalentCurious745 [link] [comments]
  • Open

    Supporting benchmarks for AI safety with MLCommons
    Posted by Anoop Sinha, Technology and Society, and Marian Croak, Google Research, Responsible AI and Human Centered Technology team Standard benchmarks are agreed upon ways of measuring important product qualities, and they exist in many fields. Some standard benchmarks measure safety: for example, when a car manufacturer touts a “five-star overall safety rating,” they’re citing a benchmark. Standard benchmarks already exist in machine learning (ML) and AI technologies: for instance, the MLCommons Association operates the MLPerf benchmarks that measure the speed of cutting edge AI hardware such as Google’s TPUs. However, though there has been significant work done on AI safety, there are as yet no similar standard benchmarks for AI safety. We are excited to support a new effort by …  ( 91 min )
    Spoken question answering and speech continuation using a spectrogram-powered LLM
    Posted by Eliya Nachmani, Research Scientist, and Alon Levkovitch, Student Researcher, Google Research The goal of natural language processing (NLP) is to develop computational models that can understand and generate natural language. By capturing the statistical patterns and structures of text-based natural language, language models can predict and generate coherent and meaningful sequences of words. Enabled by the increasing use of the highly successful Transformer model architecture and with training on large amounts of text (with proportionate compute and model size), large language models (LLMs) have demonstrated remarkable success in NLP tasks. However, modeling spoken human language remains a challenging frontier. Spoken dialog systems have conventionally been built as a c…  ( 93 min )
  • Open

    Intelligently search Drupal content using Amazon Kendra
    Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra helps you easily aggregate content from a variety of content repositories into a centralized index that lets you quickly search all your enterprise data and find the most accurate answer. Drupal is a content management software. It’s used to make many […]  ( 7 min )
    Intuitivo achieves higher throughput while saving on AI/ML costs using AWS Inferentia and PyTorch
    This is a guest post by Jose Benitez, Founder and Director of AI and Mattias Ponchon, Head of Infrastructure at Intuitivo. Intuitivo, a pioneer in retail innovation, is revolutionizing shopping with its cloud-based AI and machine learning (AI/ML) transactional processing system. This groundbreaking technology enables us to operate millions of autonomous points of purchase (A-POPs) […]  ( 8 min )
    Empower your business users to extract insights from company documents using Amazon SageMaker Canvas Generative AI
    Enterprises seek to harness the potential of Machine Learning (ML) to solve complex problems and improve outcomes. Until recently, building and deploying ML models required deep levels of technical and coding skills, including tuning ML models and maintaining operational pipelines. Since its introduction in 2021, Amazon SageMaker Canvas has enabled business analysts to build, deploy, […]  ( 8 min )
  • Open

    Turning the Tide on Coral Reef Decline: CUREE Robot Dives Deep With Deep Learning
    Researchers are taking deep learning for a deep dive, literally. The Woods Hole Oceanographic Institution (WHOI) Autonomous Robotics and Perception Laboratory (WARPLab) and MIT are developing a robot for studying coral reefs and their ecosystems. The WARPLab autonomous underwater vehicle (AUV), enabled by an NVIDIA Jetson Orin NX module, is an effort from the world’s Read article >  ( 8 min )
    The Sky’s the Limit: ‘Cities: Skylines II’ Streams This Week on GeForce NOW
    The cloud is full of treats this GFN Thursday with Cities: Skylines II now streaming, leading 15 newly supported games this week. The game’s publisher, Paradox Interactive, is offering GeForce NOW one-month Priority memberships for those who pick up the game first, so make sure to grab one before they’re gone. Among the newly supported Read article >  ( 7 min )
  • Open

    Project Silica: Sustainable cloud archival storage in glass
    This research paper was presented at the 29th ACM Symposium on Operating Systems Principles (opens in new tab) (SOSP 2023), the premier forum for the theory and practice of computer systems software. For millennia, data has woven itself into every facet of our lives, from business and academia to personal spheres. Our production of data […] The post Project Silica: Sustainable cloud archival storage in glass appeared first on Microsoft Research.  ( 10 min )
  • Open

    Frontier risk and preparedness
    To support the safety of highly-capable AI systems, we are developing our approach to catastrophic risk preparedness, including building a Preparedness team and launching a challenge.  ( 2 min )

  • Open

    [D] LLMs playing chess are sensitive to how the position came to be
    Link - https://github.com/dpaleka/llm-chess-proofgame TLDR; The lead up to the state of the board and not just the state of the board at inference affects predictions. submitted by /u/MysteryInc152 [link] [comments]  ( 9 min )
    [D] A script to pre-process arxiv sources?
    People train LLMs on arxiv sources, so there must be some sort of software to whip them into shape. Specifically, I'm looking for a script to join all the tex files for a paper into one. Note that it's not just a matter of substituting \input's - sometimes it's not clear which file is the main one, so it needs to handle this too. submitted by /u/Foxtr0t [link] [comments]  ( 9 min )
    [R] Human-like systematic generalization through a meta-learning neural network
    Work. I am not affiliated with this work or its authors. Article about the work. Twitter thread about the work from one of its authors. Abstract: The power of human language and thought arises from systematic compositionality—the algebraic ability to understand and produce novel combinations from known components. Fodor and Pylyshyn famously argued that artificial neural networks lack this capacity and are therefore not viable models of the mind. Neural networks have advanced considerably in the years since, yet the systematicity challenge persists. Here we successfully address Fodor and Pylyshyn’s challenge by providing evidence that neural networks can achieve human-like systematicity when optimized for their compositional skills. To do so, we introduce the meta-learning for compositionality (MLC) approach for guiding training through a dynamic stream of compositional tasks. To compare humans and machines, we conducted human behavioural experiments using an instruction learning paradigm. After considering seven different models, we found that, in contrast to perfectly systematic but rigid probabilistic symbolic models, and perfectly flexible but unsystematic neural networks, only MLC achieves both the systematicity and flexibility needed for human-like generalization. MLC also advances the compositional skills of machine learning systems in several systematic generalization benchmarks. Our results show how a standard neural network architecture, optimized for its compositional skills, can mimic human systematic generalization in a head-to-head comparison. submitted by /u/Wiskkey [link] [comments]  ( 9 min )
    [R] Researchers discover that in-context learning creates task vectors in LLMs
    A new paper provides some insight into how in-context learning works in LLMs. This study proposes and provides evidence for an elegant structure within the in-context learning process. The models appear to create a "task vector" that encapsulates the core logic from the demonstration examples, in a way that is independent of any specific query. This vector serves as a compressed representation of the task. A separate component then takes this task vector and a new query as inputs to generate the output, without directly referencing the original examples. In essence: Output = Apply(query, Learn(examples)) Where "Learn" derives the task vector from the examples, and "Apply" utilizes the vector and query to produce the output. The researchers validated this hypothesis by testing major public models on diverse tasks such as translation and algorithmic reasoning. Key findings: Isolating the Learn and Apply components maintained high accuracy, demonstrating the viability of the separation. Task vectors clustered by task and remained consistent within tasks, indicating they encode meaningful task representations. Injecting another task's vector into the model caused it to override contradictory examples and follow the vector, highlighting the vector's dominance. Vectors induced relevant token distributions despite those terms being absent from the examples, suggesting semantic encoding of the task. Taken together, these results provide substantial evidence that in-context learning involves creating a task vector that encapsulates the examples' logic to then guide behavior on new queries. While open questions remain regarding implementation details, this is a significant step towards demystifying an interesting AI capability. Full writeup. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] What is more Important loss or accuracy?
    I have created a basic classification model and there is something that I don't fully comprehend, as the loss decreases the accuracy increases (I assume this is how it should be in ideal scenarios) while this is the general trend there is a point where the loss is minimum and while accuracy at that point is high it's not the highest. Why would such a phenomenon occur? And since it occurred what is a better metric for the evaluation of the model? submitted by /u/rakk109 [link] [comments]  ( 9 min )
    [P] Locally hosted audio-to-text transcription - model and hardware?
    Hi, I'm looking for a locally hosted LLM to transcribe audio files to text. I need this for my business, but with absolute privacy (witness testimony recordings and other highly sensitive data). I figured I'd just buy a new computer which never gets connected to the internet and is dedicated only to audio processing to have absolute security. My questions are: - Which is the best model to use? I prefer accuracy and don't mind processing time as long as it's getting done within a few hours to even a day or two (I need to transcribe maximum 1 file per day, but up to six hours of audio), so I figured the large Whisper - probably WhisperX ? - would be my best bet. Are there comparable non-openAI models? (I need diarization) - What hardware should I get for this? Cost is secondary/irrelevant, although I don't want to spend 5 figures on a GPU - I can accept some processing time submitted by /u/Jealous_Pomelo_1172 [link] [comments]  ( 9 min )
    Best NLP Package in Python to extract medical test results from medical notes? [D]
    I am trying to extract FEV1 (forced expiratory volume) values from a dataset that contains a column with report notes from the doctor assessing the patient with pulmonary function testing. I have been able to build out a sort of solution with regexes in Python, and that's somewhat effective. But I've been instructed to code up an alternative using a more machine learning-based approach. I wanted to use spaCy to accomplish this but I'm not sure exactly how to implement the code nor if spaCy is the best package to use for this task. Here is my regex code that works decently. It's pretty messy and have to take into account a ton of edge cases which can get cumbersome. This is why I'd like to find a more automated solution. #Attempting to add in percents df = pd.read_excel('[mypath]/pft_tiu…  ( 11 min )
    [P] Adala – an open source Autonomous DAta (Labeling) Agent framework that helps you automate data processing and data labeling
    Hi r/MachineLearning, We have just open sourced Adala - a robust framework for implementing agents that specialize in advanced data processing tasks, starting with data labeling and generation. Agents combine knowledge outputs from LLMs and action on them in production systems, thus their reliability to correctly and consistently perform operations is critical. We saw an opportunity to create a new agent framework that could dramatically increase the efficiency of data labeling (and broader application across data processing tasks), with the unique ability to be guided by human feeback. To ensure agents remember and build upon their experiences, Adala provides a Memory component—a dynamic storage space for the agent's acquired knowledge. For instance, retrieving the previous experiences of an agent’s errors (and subsequent human feedback) allows them a starting point from which to branch off into learning or improving skills. To allow Adala to produce reliable agents, we devised two main strategies: Supervision Integration: Provide agents with 'ground truth data'—well-defined examples that serve as a learning foundation. This foundational data not only sets the learning parameters for the agent but also defines its operational environment. Constrained Generation: Ensuring that an agent's predictions are within a defined and bounded range of outputs. Let us know what you think in the comments below or by contributing to the repo. Adala framework overview ​ submitted by /u/pirate7777777 [link] [comments]  ( 9 min )
    [P] Pre-training dataset
    I'm trying to pre-train my own language model on some high quality datasets (TinyStories,tiny-textbooks...). Some of these datasets include input-output data and some are just text (stories), I was wondering how should I format the data for pre-training. Should I only use plain text like stories and webtext in pretraining then the rest in fine-tuning (adding instruction tokens) or should I just train with all of the datasets at pre-training with the special tokens where they are needed? submitted by /u/Additional-Ad-7043 [link] [comments]  ( 9 min )
    [Research] large language/speech models and voice interface research
    Hey ML folks, My friend is working on his academic research project where he is exploring voice research spealizing in large speech models. If you have time, help him advance his research on voice interfaces. should take 2 mins max. https://forms.gle/a3PaQmYEiqRDxY4Z8 whats in it for me ? you can share email to get a copy of the research and listen what the rest of us have said. Thanks! submitted by /u/deep-thoughts-guy [link] [comments]  ( 9 min )
    [R] Open Source video enhancement options
    We work the disease prediction based on video classification and would like to test what improving the quality of videos would do for our models, any specific components, apps or packages we should test? So far used UpScayl, not sure how that ranks submitted by /u/sladebrigade [link] [comments]  ( 9 min )
    [P] OSS tool to interactively explore Hugging Face datasets with one line of code
    submitted by /u/44sps [link] [comments]  ( 8 min )
    [P] Training a transformer from scratch
    Hello! I would like to train a transformer network from scratch, without pre-training, on a language modeling task (next work prediction) or a sequence-to-sequence task (translation). For the language modeling task, I tried with the Shakespeare dataset, and other simpler ones (e.g., Beatles songs), but it tends to overfit quite quickly on the training set, probably because the corpus is too short. I know that Andrej Karpathy did it with the Shakespeare dataset in his YouTube video, but he used a character-wise tokenisation, which dramatically reduces the validation loss on the next-work prediction task, given that the vocab size is tiny. I guess that at the end the generation process provides a similar quality of text as when a word tokenizer is used. Surprisingly, I had quite good results by training from scratch an Encoder-Decoder model, for English-to-French translation (using the 8 million examples of the Tatoeba dataset). I guess here, the overfitting is less prominent because there are more datapoints, and that the possibility of predictions are much more constrained, due to the input sequence. What are you guys experience with this? I would be happy to know how I can train my transformer without having to use a pre-trained architectures or spend weeks on GB datasets. Thank you! submitted by /u/rem_dreamer [link] [comments]  ( 9 min )
    Data analysis vs ML engineering [D]
    Do you think coursera certifications, besides a master in electrical engineering, can help us find better occupational positions? I am told that for a beginner it is better to start with jobs in Data analysis rather than going directly to ML engineering. Is it corr? Is data analysis a prerequisite for ML? submitted by /u/Street-Regular-9924 [link] [comments]  ( 9 min )
    [D][R] How should the architecture of a transformer be scaled?
    When increasing the parameters of a (decoder-only) transformer, one has a choice around how to spend that increased budget -- number of layers vs embedding dim vs number of heads. Anyone know if there's solid guidance out there for the proportions each aspect should be scaled in? E.g. looking at LLaMa (https://arxiv.org/abs/2302.13971), they seem to scale the first sizes two proportionally, but for larger sizes, n heads grows more slowly. https://preview.redd.it/ytdfk1d5rbwb1.jpg?width=1422&format=pjpg&auto=webp&s=abf22ac369ec5ecf81ff07b0d8a095f884efe729 submitted by /u/Tea_Pearce [link] [comments]  ( 9 min )
    [D] What are some existing datasets for training LLMs to perform reasoning, acting as agents?
    There are a lot of great open datasets for fine-tuning LLMs for instruction following (e.g LIMA, self-instruct, dolly-15k, etc) and as chat bots (OASST, etc). One thing I have not really seen yet are datasets that involves planning and tool use. Is anybody working on something like that or have come across any? I'm interested in working on one. If anybody has ever attempted this, I would really appreciate any advice. P.S I do note that "reasoning" should be more rigorously defined and scoped, but I think some ambiguity around an intellectual discussion like this can help. submitted by /u/notllmchatbot [link] [comments]  ( 9 min )
    [D] Open-source SOTA Audio-to-Audio: how do I sound like a famous actor?
    Hello people, I would like to learn how to turn the recording of my voice to sound like a famous person. I imagine I would take an open source model and fine-tune it using data I will collect. Can someone point me towards the best sounding current models that I could adapt for this purpose? Thank you so much. submitted by /u/gonzales82 [link] [comments]  ( 9 min )
    [D] Guidance needed for upcoming AI/ML PhDs on selecting research topics with lasting impact
    Many upcoming Ph.D. students in AI/ML are facing the difficult decision of identifying promising research topics that will stand the test of time over the time of their Ph.D. studies. With the rapid progress in AI, especially in the NLP field, many incremental research tasks have been effectively "solved". Need to choose an area where there is ample room for open-ended inquiry and meaningful contributions over 4-5 years of PhD research. While large language models have shown impressive advances recently, their capabilities may plateau during a Ph.D. (if starting the Ph.D. from next year ~ 4 years) timeframe. How should aspiring researchers choose topics resilient enough to withstand the test of time and allow them to push the field forward through their Ph.D. work? For those with experience in AI research who have seen changes in the field over time: What emerging trends or broad areas do you see as fertile ground for AI/ML PhD research now and in the coming years? Can you highlight any intriguing subfields worthy of deeper investigation by aspiring PhD students? What open problems or applications warrant more attention from the upcoming generation of PhD researchers? Some of tending Research topics so far: LLM in a specific domain Prompting Evals LM interfaces Safety Understanding LMs Emergence Any advice on identifying PhD research topics with longevity would be greatly appreciated by aspiring graduate students. submitted by /u/aadityaura [link] [comments]  ( 9 min )
    [P][R] Test-Val scores, how much difference isn't problematic.
    Hello folks, I'm working on a medical image dataset using EM loss and asymmetric pseudo labelling for single positive multi-label learning (only training using 1 positive label). I'm using a densenet121 and on a chest x-ray dataset. I see a difference of 10% in my validation vs test score (score = mAP: mean average precision). The score seems okay and was expected but the difference is bothering me. I understand that it's obvious but any visual insights from your side? (Attaching plot below) The validation set consist less than half of test set samples. (It is the official split; I have nothing to do with it). I feel it is the reason, as ofcourse more the randomness in a set, poorer the convergence. ​ https://preview.redd.it/nseqy1mw5bwb1.png?width=577&format=png&auto=webp&s=fbd63e8a5f4920a8109b6a75aeb039a3965bba58 Do share any experiences or suggestions! submitted by /u/ade17_in [link] [comments]  ( 9 min )
    [D] Are there method that can extract interaction between person in text?
    I want to extract interaction between persons in short text. For example, "Sally will buy a new phone. Ted will help her." contains interaction between persons. However, "Japanese Karate champion won the first prize." and "Sally missed her friends, Ted and Tom" does not contain interaction between persons. Is there any tools or methods that can extract interactions? submitted by /u/tkddnjs1234 [link] [comments]  ( 9 min )
    [D] Who are some outspoken AI people who speak against AI ethics and regulation?
    I'm interested in learning more about the perspectives of AI researchers and practitioners who are critical of AI ethics and regulation. I'm particularly interested in those who argue that AI ethics and regulation are unnecessary or harmful. Please note that I'm not asking for people who are simply skeptical of certain AI ethics proposals or who believe that AI ethics should be implemented in a specific way. I'm more interested in people who argue that AI ethics is a fundamentally flawed concept or that AI should not be regulated at all. submitted by /u/Periplokos [link] [comments]  ( 9 min )
    Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models
    submitted by /u/simandl [link] [comments]  ( 8 min )
    [D] What should I do for training when data to predict has random distribution?
    I was taught that when doing imbalance classification, the training data should be augmented to more or less match the number of classes, but the validation data should have the same distribution as the test data. And the test data should have the similar distribution as the data I will actually predict. But what if real data's distribution is quite random? What validation data distribution should I use? (I got 14 classes to classify, and 1 of classes has 52% proportion, and small ones have 0.9%, and 0.17% proportion. Practitioners who would use my model input data that only 3 classes to classify, and they can be very small proportion. The training data before augmentation was created by integrating data with this irregular distribution.) submitted by /u/poemfordumbs [link] [comments]  ( 9 min )
  • Open

    Looking back at wildfire research in 2023
    Posted by Yi-Fan Chen, Software Engineer, and Carla Bromberg, Program Lead, Google Research Wildfires are becoming larger and affecting more and more communities around the world, often resulting in large-scale devastation. Just this year, communities have experienced catastrophic wildfires in Greece, Maui, and Canada to name a few. While the underlying causes leading to such an increase are complex — including changing climate patterns, forest management practices, land use development policies and many more — it is clear that the advancement of technologies can help to address the new challenges. At Google Research, we’ve been investing in a number of climate adaptation efforts, including the application of machine learning (ML) to aid in wildfire prevention and provide information to…  ( 92 min )
    Grammar checking at Google Search scale
    Posted by Eric Malmi, Senior Research Scientist, and Jakub Adamek, Senior Software Engineer, Google, Bard Team Many people with questions about grammar turn to Google Search for guidance. While existing features, such as “Did you mean”, already handle simple typo corrections, more complex grammatical error correction (GEC) is beyond their scope. What makes the development of new Google Search features challenging is that they must have high precision and recall while outputting results quickly. The conventional approach to GEC is to treat it as a translation problem and use autoregressive Transformer models to decode the response token-by-token, conditioning on the previously generated tokens. However, although Transformer models have proven to be effective at GEC, they aren’t part…  ( 91 min )
  • Open

    We hate how "black box" neural nets are, we made a thingy in an attempt to demystify their "thinking."
    submitted by /u/DeltaStarStudiosPR [link] [comments]
    AI ‘breakthrough’: neural net has human-like ability to generalize language
    submitted by /u/nickb [link] [comments]
    [Long read] Deep dive into AutoGPT: A comprehensive and in-depth step-by-step guide to how it works
    https://airt.hashnode.dev/long-read-deep-dive-into-autogpt-a-comprehensive-and-in-depth-step-by-step-guide-to-how-it-works submitted by /u/Harish_Mohanraj [link] [comments]
  • Open

    Trust in AI, Data Poisoning, and Involving People in Maturing AI
    submitted by /u/fookingyeah [link] [comments]
    Baby AGI and AgentGPT : Exploring Autonomous AI-Agents
    submitted by /u/Tao_Dragon [link] [comments]
    How can i use AI to research for my thesis?
    hey all imnewto this can you help me please ? submitted by /u/proptuxiakoskariolis [link] [comments]
    When I use AI to generate Halloween candy wrappers and then print them out...
    submitted by /u/Sea_Permit5660 [link] [comments]
    DTIYS Challenge Submission Sample Art for Oh my Anne
    submitted by /u/Oh_my_Winnie [link] [comments]
    One-Minute Daily AI News 10/24/2023
    OpenAI Executives Sam Altman Say AI Will Be Able to Do Any Job Within 10 Years.[1] Snapdragon 8 Gen 3 chipset officially announced with AI-driven functionalities.[2] Google parent Alphabet reported its third quarter earnings Tuesday, which showed more spending on AI infrastructure and muted cloud growth, culminating into several questions for executives about how all the efforts around artificial intelligence are actually going to turn into real money.[3] Adult film star Riley Reid(I don’t know who she is) launches Clona.AI, a sexting chatbot platform.[4] Sources: [1] https://www.wsj.com/podcasts/the-journal/a-conversation-with-openais-sam-altman-and-mira-murati/7c89e85f-9d7e-4569-b67d-6a777374eada [2] https://headtopics.com/my/snapdragon-8-gen-3-chipset-officially-announced-with-47616340 [3] https://www.nbcdfw.com/news/business/money-report/wall-street-wants-to-know-how-googles-going-to-profit-from-ai/3368989/ [4] https://www.engadget.com/adult-film-star-riley-reid-launches-clonaai-a-sexting-chatbot-platform-000509221.html submitted by /u/Excellent-Target-847 [link] [comments]
    Would majoring in artificial intelligence be worth it?
    The AI boom has made it more relevant than ever, and its applications are truly awe-inspiring. While it’s far from perfect, it has helped me greatly in writing, by generating content to inspire me and my projects. I have a smattering of skills, none that I’d consider especially good enough to double down upon, but learning how to optimize language learning models to produce the most adequate results would be pretty neat. I just don’t know what I want to do with my education, I’ve completed my basics and as such have a blank slate to play with, but I’m worried that whatever I select, it will be no good, and just result in lost time and money. Tertiary education seems like a necessity in the modern world, especially since the job world is more ruthless than ever, and the economy is in ashes. submitted by /u/Niobium_Sage [link] [comments]
    Need to find an Ai
    Which AI does these cartoon? submitted by /u/hommedufuture [link] [comments]
    I've been playing around with Midjourney a little bit and this is what I got.
    ​ https://preview.redd.it/2emqr4z8a9wb1.png?width=928&format=png&auto=webp&s=437547e7e86e23298b7c778cada9863385ce961d PROMT close up of eye, close up of girl eye, mangekyo sharingan, super close up, pretty eye, black and red eye, naruto anime, long eyelashes, anime eye, 2d art eye, --s 180 --style expressive ​ https://preview.redd.it/4dffpg2ca9wb1.png?width=928&format=png&auto=webp&s=fc485a604460f7e544d0490d0bee65f984d8a5b3 PROMT **stained glass, it was meticulously written, picture with elaborate writing, cute girl smile with Rabbit,Flower, bold and strong line drawing, vivid acrylic painting, vivid thick paint, vivid, plain background, beautiful proof, highest resolution 16K, beautiful anime girl that is betrayeded by a Rabbit, hair is short, ferret, Beautiful lightcyan high ligh…
  • Open

    Detection and high-frequency monitoring of methane emission point sources using Amazon SageMaker geospatial capabilities
    Methane (CH4) is a major anthropogenic greenhouse gas that‘s a by-product of oil and gas extraction, coal mining, large-scale animal farming, and waste disposal, among other sources. The global warming potential of CH4 is 86 times that of CO2 and the Intergovernmental Panel on Climate Change (IPCC) estimates that methane is responsible for 30 percent of observed […]  ( 12 min )
  • Open

    Research Focus: Week of October 23, 2023
    In this issue: Kosmos-2.5: A Multimodal Literate Model; Can vine copulas explain complex relationships of weather variables; New system accelerates the adaptive training process; Structural inequalities and relational labor in the influencer industry. The post Research Focus: Week of October 23, 2023 appeared first on Microsoft Research.  ( 10 min )
  • Open

    Next-Gen Neural Networks: NVIDIA Research Announces Array of AI Advancements at NeurIPS
    NVIDIA researchers are collaborating with academic centers worldwide to advance generative AI, robotics and the natural sciences — and more than a dozen of these projects will be shared at NeurIPS, one of the world’s top AI conferences. Set for Dec. 10-16 in New Orleans, NeurIPS brings together experts in generative AI, machine learning, computer Read article >  ( 8 min )
  • Open

    The right to perform RL on games
    Hi all, I'm new to learning RL. I want to train an agent to clear a game such as vampire survivor, super mario brothers, etc, as my first research/project. I talked with my tutor , he reminded me to pay attention to copyright issues and that I needed a permission to use these works for training. I guess I could get permission by asking the game's author directly, but before that, or for games produced by some big companies, where can I find information about the rights? Although reading the game's memory is a challenge for me, it's cool to see a agent clear a game. submitted by /u/Ruine_fff [link] [comments]
    Building Doom with AI enemies
    I'm planning to go down the rabbit hole of using RL to train agents in doom/vizdoom The goal would be to create a version of doom where the enemies have AI and are adaptive. Doom and Doom 2 are some of my all time classic favorites. There are people still making maps to this day! Let me know on what you think about the idea? Project plan - Nov 2023 : RL refresher from the David Silver RL course on YouTube Dec 2023 : start working on openAI and stablebaselines3 and watch Nicholas Renotte's videos Jan 2024 : play around with the Doom WAD and try to see if you can make changes to the doom engine + Training and setting up custom env Feb 2024 : hopefully first level with enemy AI created Mar 2024 : release fully completed open source version of the game Background: I work at a hedge fund, have some basics on reimbursement learning, although it has been a long long time. Time is a bit limited after 12 hours or work and 2 hours of gym (the real human world one) so kinda stretching this out Any suggestions are welcome. Any courses, books, libraries and tools you'd suggest? submitted by /u/Sahil231090 [link] [comments]
    "Surprise" for learning?
    I was recently listening to a TalkRL podcast where Danijar Hafner explains that Minecraft as a learning environment is hard because of sparse rewards (30k steps before finding a diamond). Coincidentally, I was reading a collection neuroscience articles today where surprise or novel events are a major factor in learning and encoding memory. Does anyone know of RL algorithms that learn based on prediction error (i.e. "surprise") in addition to rewards? submitted by /u/CognitoIngeniarius [link] [comments]
  • Open

    Frontier Model Forum updates
    Together with Anthropic, Google, and Microsoft, we’re announcing the new Executive Director of the Frontier Model Forum and a new $10 million AI Safety Fund.  ( 4 min )

  • Open

    A warning about an unknown danger of AI. Current uses of AI have been overwhelmingly positive but there is an unknown danger that I would like to speak to.
    I want to warn AI companies and developers about a danger that is not known about regarding AI. The reason it is not known about regarding AI is that it isn't known about in general and so the AI community can hardly be blamed for that. Unfortunately, the danger here has to do with the fundamental nature of human society and social interaction as it stands at this time. The issue is that there is 'hidden language' used in social communication and unlike typical conceptions of things like body language this is not auxiliary to our rational purposes, rather our rational purposes are auxiliary to the hidden communication. One way of describing it would be that our formal language is a 'carrier wave' to encode other information about our status and the status of others. So our communications …
    AI Psychology Test: What happens in viewers' mind when news segments about important major events shift to commercials where the announcer is talking like a comic character?
    When news segments covering major, often serious, events abruptly switch to lighthearted or comical commercials, a cognitive dissonance can occur in the viewer. Here's why: news programs are designed to engage the viewer's analytical faculties. They present facts, figures, and expert opinions, demanding cognitive effort to understand the implications. The viewer is in a "serious" mode, applying critical thinking to absorb the information. Commercials, particularly the comic ones, often aim for emotional engagement rather than intellectual analysis. They use humor, catchy jingles, and attractive visuals to create a positive association with the product being advertised. When the transition between these two contrasting tones is sudden, the viewer has to perform a rapid mental shift from analytical to emotional engagement. This can be jarring. This dissonance can have a few different outcomes. For one, it might diminish the impact of both the news segment and the commercial. The viewer might find it difficult to fully engage with either, as the cognitive "gear shifting" can be distracting. Secondly, this dissonance can potentially undermine the gravitas of the news. When sandwiched between comic commercials, serious topics might lose some of their perceived importance. Lastly, it can make the commercial less effective. The viewer, still in a serious mindset, may not be as receptive to the emotional triggers that the commercial aims to pull. So, in essence, this rapid shift can dilute the efficacy and impact of both the news and the advertising, while causing cognitive friction for the viewer. CGPT-4 submitted by /u/Georgeo57 [link] [comments]
    Any good AI-integrated video games?
    Does anybody know of any good AI integrated games that have been released or are in beta? I'm really interested to see how people have incorporated the current boom in AI into game design. submitted by /u/Rfallmann [link] [comments]
    Managing AI Risks in an Era of Rapid Progress
    The rapid progress of AI development brings both opportunities and risks. While AI systems have the potential to cure diseases and elevate living standards, they also pose large-scale risks that we are not prepared to handle. Without proper safety measures and ethical considerations, advanced AI systems could amplify social injustice, erode social stability, and enable criminal activities. The development of highly advanced autonomous AI systems also raises concerns about the pursuit of undesirable goals and the loss of human control. To ensure a positive outcome, research breakthroughs in AI safety and ethics are needed, along with effective government oversight. Source : https://managing-ai-risks.com/ submitted by /u/NuseAI [link] [comments]
    Deepfakes Just Got Very Real
    Interesting read about deepfakes that started with a Reddit post. https://www.linkedin.com/pulse/deepfakes-just-got-very-real-scott-clark-sfurc submitted by /u/scottimherenowwhat [link] [comments]
    How AI could change Google search and wipe out $68 billion SEO industry | Fortune
    Oh well 🤷‍♂️ submitted by /u/AminoOxi [link] [comments]
    🦾ERNIE 4.0 vs GPT-4, Tightened AI Chip Restrictions, Alibaba Tencent Fund AI Startup, and China's Global AI Governance Initiative
    submitted by /u/trcytony [link] [comments]
    Stanford AI Conference - New Horizons in Generative AI: Science, Creativity, and Society - Livestreaming Now
    submitted by /u/Nice-Inflation-1207 [link] [comments]
    Dancing with Light: A Hummingbird's Enchanted Encounter.
    submitted by /u/IllustriousVideo6145 [link] [comments]
    150+ Awesome ''Act As'' ChatGPT Prompts
    submitted by /u/Senior_tasteey [link] [comments]
    ChatGPT, invent comics for robots.
    submitted by /u/Philipp [link] [comments]
    An A.I. video interpretation of "Metamorphosis Two" by Philip Glass
    submitted by /u/AnimalsChasingCars [link] [comments]
    I have a question
    What’s the best voice ai for song covers? Like I wanna do someone like Donald Trump, Cartman, Ice King/Simon singing The Boys (Eng Ver) by SNSD. Also it has to be free! submitted by /u/Ok-Upstairs-9887 [link] [comments]
    Apple and AI
    Apple has been behind in the AI field compared to companies like OpenAI, Google, Microsoft, and Amazon. While Apple has made improvements in autocorrect and AI features in Photos, it needs to catch up to remain competitive. Apple executives have been scrambling to make up for lost time and have been working on generative AI technology. There is anxiety within Apple about whether their AI/ML team can deliver. Source : https://daringfireball.net/2023/10/apple_and_ai submitted by /u/NuseAI [link] [comments]
    🚀 Gaming with ChatGPT using Encrypted Prompts and Prompt Injection! 🎮
    submitted by /u/Gloomy_Recognition_4 [link] [comments]
    How are neobanks utilizing AI to offer more accurate and personalized financial advice to customers?
    Your answers are appreciated. submitted by /u/Cygnet-Digital [link] [comments]
    One-Minute Daily AI News 10/23/2023
    The U.S. Senate will hold the second in a series of bipartisan AI Insight Forums on Tuesday, Oct. 24, where senators will hear from some of the most influential tech leaders to help inform regulations around the technology.[1] Microsoft announces A$5 billion investment in computing capacity and capability to help Australia seize the AI era.[2] Samsung is going all in with the AI performance of the Galaxy S24 phones.[3] Reddit has reportedly decided to block AI startups from scraping data from its website. This move prevents third-party companies from using Reddit’s data to train their machine-learning models without permission.[4] Sources: [1] https://news.asu.edu/20231020-government-calling-tech-leaders-help-crafting-artificial-intelligence-legislation [2] https://news.microsoft.com/en-au/features/microsoft-announces-a5-billion-investment-in-computing-capacity-and-capability-to-help-australia-seize-the-ai-era/ [3] https://www.androidheadlines.com/2023/10/samsung-galaxy-s24-smartest-ai-phone.html [4] https://www.androidheadlines.com/2023/10/reddit-block-ai-startups-scraping-data.html submitted by /u/Excellent-Target-847 [link] [comments]
    オレの攻撃からお前は逃れられぬ。 いかなる人間も、死という現実から決して逃れられぬように。 受け入れることだ。定めよ。
    submitted by /u/nicdunz [link] [comments]
    Anti deepfake headset V2
    You can find out more here in the comments submitted by /u/ahauss [link] [comments]
  • Open

    [D] How should I calculate the weights for a multi-label classification task where the labels are dependent among one another?
    I'm not sure if I worded the title correctly. Let me elaborate on the scenario. I have a multi-label image classification task where I'm trying to classify the gender of clothing images. The two labels that we can predict are Male and Female, hence the final logit vector's size would be something like [batch_size, 2]. Depending on the predictions, we're mapping the following binary values to different categorical values: [0, 0]: Unknown [0, 1]: Male [1, 0]: Female [1, 1]: Unisex The overall distribution is heavily imbalanced with Male being the minority class. I'm trying to calculate class weights to favor Male, but the problem is that the size of the weight tensors to be provided to the loss function should have a length of 2. I say this is a problem because although the number of prediction logits is 2, the actual number of classes is 4. I used the word "dependent" in my title because, for example, [1, 1] wouldn't necessarily mean that the image has the labels Male and Female, rather that it's a completely new Unisex label. Again, not sure if the usage of the word is appropriate. Anyway I've thought of making a custom loss function to first map the binary labels to their respective categorical values, but am wondering if there is any other way to go about this. submitted by /u/Seankala [link] [comments]  ( 9 min )
    [D] LSTM: Train & Val losses not converging
    I am training an LSTM model for path prediction where I'm feeding in OBT (on-board Time) and X matrix as input and Y matrix is the predecessor matrix generated using Scipy.Dijkstra ​ This is the model architecture for reference, This is the model architecture for reference, I've tried multiple iterations of this similar model, but the training and validation loss, are not converging. The best train_loss i've been able to achieve is 88k mse and 400 mse val_loss I've uploaded the dataset here: GitHub - mathur-exe/LSTM_Dataset Training Progress: Epoch 1/100 342/342 - 17s - loss: 22606898.0000 - val_loss: 61414736.0000 - 17s/epoch - 49ms/step Epoch 2/100 342/342 - 14s - loss: 7990657.0000 - val_loss: 3699703.5000 - 14s/epoch - 40ms/step Epoch 3/100 342/342 - 13s - loss: 4130298.7500 - val_loss: 136808.1094 - 13s/epoch - 38ms/step Epoch 4/100 342/342 - 12s - loss: 2747299.2500 - val_loss: 35710.1680 - 12s/epoch - 35ms/step Epoch 5/100 342/342 - 12s - loss: 2558378.2500 - val_loss: 3383.4780 - 12s/epoch - 36ms/step Epoch 6/100 342/342 - 13s - loss: 1214455.8750 - val_loss: 111625.2891 - 13s/epoch - 37ms/step Epoch 7/100 342/342 - 19s - loss: 337480.2500 - val_loss: 68686.6094 - 19s/epoch - 55ms/step Epoch 8/100 342/342 - 15s - loss: 316366.7188 - val_loss: 2059.3557 - 15s/epoch - 44ms/step Epoch 9/100 342/342 - 20s - loss: 293117.0312 - val_loss: 20961.5469 - 20s/epoch - 58ms/step Epoch 10/100 342/342 - 17s - loss: 575945.1875 - val_loss: 503602.8438 - 17s/epoch - 50ms/step Epoch 11/100 342/342 - 13s - loss: 290962.8750 - val_loss: 62491.9375 - 13s/epoch - 37ms/step Epoch 12/100 342/342 - 12s - loss: 1125042.5000 - val_loss: 36054.6836 - 12s/epoch - 36ms/step Epoch 13/100 ... 342/342 - 16s - loss: 230900.7969 - val_loss: 48309.6094 - 16s/epoch - 47ms/step Epoch 93/100 342/342 - 23s - loss: 232846.6094 - val_loss: 82926.6875 - 23s/epoch - 67ms/step submitted by /u/Gaurang_Mathur_ftw [link] [comments]  ( 9 min )
    [D] Will ChatGPT remove the need for data annotation?
    I wrote a blog post about this detailing my experience, which I will attach at the bottom but I want to hear opinions of people. It is something I've actively been thinking about, and would like to know potential pitfalls and why it may not work, rather than the huge promise it holds. https://ozanciga.wordpress.com/2023/10/24/will-chatgpt-remove-the-need-for-data-annotation/ submitted by /u/ozanciga [link] [comments]  ( 9 min )
    [R] Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
    submitted by /u/hzj5790 [link] [comments]  ( 9 min )
    [D] Is it better to create a different set of Doc2Vec embeddings for each group in my dataset, rather than generating embeddings for the entire dataset?
    I'm using Top2Vec with Doc2Vec embeddings to find topics in a dataset of ~4000 social media posts. This dataset has three groups: Posts from a company (3%) Posts from this company's potential customers (82.5%) Posts from this company's competitors (14%) The purpose of this analysis is to look at the topics the company is posting about on social media and see how it compares to the things that their customers and competitors are posting about. Since the values of Doc2Vec embeddings depend on the other documents in the dataset, I'm worried that topics in smaller groups are going to be drowned out by the larger group. I'm worried that the differences between the document vectors in the smaller group are going to be made smaller by presence of the documents from the larger group, which may represent a much wider array of different topics. submitted by /u/abelEngineer [link] [comments]  ( 9 min )
    [D] How would you do it? Handling multi-turn QA conversation with matching of questions to vector database.
    I have been giving this some thought and would appreciate some outside input, maybe someone has some experience they could share! I am attempting to create a QA chatbot that is limited to answering questions from a pre-determined set of question and answer pairs I have in a vector database. Currently I create embeddings of the question using OpenAI and query a vector database for similar "reference question" - if the similarity score is high enough I proceed and use the answer text I have stored in the metdata as "context" for the answer generation. I would now like to extend this to include conversational history. The issue I am facing however, is that a follow-on question may not hit the similarity threshold. Considering a follow-up question would typically not be worded in a way that …  ( 10 min )
    [P] The ML Practitioner, a publication about all things machine learning and MLOps
    Hi all, my wife and I have recently started a new publication called The ML Practioner. If you're interested in writing for us, please send us a link of your unpublished draft here. Either way, please subscribe to us if you're interested in this kind of content! submitted by /u/kanxx030 [link] [comments]  ( 9 min )
    [D] efficacy of cold start preferences on recs systems
    Hi all, Are there good papers about the efficacy of cold start explicit preference collection (think Netflix “pick some movies you like”) on the recs systems? I haven’t been able to find any so far. One key aspect I’m looking for is if these are effective, how long they are relative to just implicit actions the user takes. Thanks submitted by /u/steathilynecessary [link] [comments]  ( 9 min )
    [D] Embedding models ranked by encode speed?
    Hello, the sbert.net has a list where you can sort by encode speed but its a very small subset of the HuggingFace MTEB leaderboard. AFAICT, the HuggingFace leaderboard / model pages don't have this information. Is there a list where I can see a more up-to-date list of models by encoding speed? submitted by /u/rsamrat [link] [comments]  ( 9 min )
    [D] Finite State Transducers and language productivity
    In the context of NLP, will language models based on finite state transducers (since they are finite) ultimately fail to put language's productive nature to good use? All the possible outputs a finite state transducer can produce are predictable, while all the possible outputs a given natural language can produce are much less predictable? submitted by /u/RecordingOk5720 [link] [comments]  ( 9 min )
    [P] Equinox KV Cache
    I've been trying to implement a kv cache in my language model but have been unsuccessful so far due to the dynamic shapes. I've seen some implementations in flax but was wondering if it was possible to implement in equinox as that's what I'm using and prefer over others like flax. If anyone can point me in the right direction or help with the implementation that would be great! PS: I can provide any code if wanted to help submitted by /u/Additional-Ad-7043 [link] [comments]  ( 9 min )
    Explainable Boosting Machine Local and Global Explanation plots label size [D]
    I am using EBM for a research, the local and global explanation plots it produces come with preset font size, I want to change the resolution of the figure and the font size of labels and x and y ticks in the explanation plots. I have looked for it on the InterpretML github page and issues and scrolled through various webpages but haven't found anything helpful. Used gpt but it doesnot help either, it tries to use matplotlib but EBM plots are not compatible with it. Please share any way it can be solved, because the plots labels are unreadable in the article if used as it is. submitted by /u/Horseman099 [link] [comments]  ( 9 min )
    [D][P] What is the metric for early stopping in YOLOv8 detection?
    I am trying to fine tune the yolov8 detection model an was going through the code base of ultralytics.I found this piece of code in the engine.trainer # Early Stopping if RANK != -1: # if DDP training broadcast_list = [self.stop if RANK == 0 else None] dist.broadcast_object_list(broadcast_list, 0) # broadcast 'stop' to all ranks if RANK != 0: self.stop = broadcast_list[0] if self.stop: break # must break all DDP ranks I'm familiar with how the early stopping works and not sure what they are doing here does this get invoked by default?? what is the metric that they use in order to stop it?? upon further inspection i found this self.stopper, self.stop = EarlyStopping(patience=self.args.patience), False which is imported as from ultralytics.utils.torch_utils import (EarlyStopping, ModelEMA, de_parallel, init_seeds, one_cycle, select_device, strip_optimizer) please help me find out what metric they use to stop this and if the earlystopping is invoked by default submitted by /u/rakk109 [link] [comments]  ( 9 min )
    [P] The N Implementation Details of RLHF with PPO
    We are happy to share a great repro of OpenAI's early RLHF codebase, with nearly identical learning curves. We also summarized implementation details (did you know Adam Optim's implementation details could impact RLHF?) 📜 Blog post:https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo 💾 Code: https://github.com/vwxyzjn/lm-human-preference-details submitted by /u/vwxyzjn [link] [comments]  ( 9 min )
    ML [project] [p]
    What are best ways to collect database for any ml project submitted by /u/GingSkywalker [link] [comments]  ( 8 min )
    [R] Feature Space Reduction Method for Ultrahigh-Dimensional, Multiclass Data: RFMS
    We are excited to announce the publication of our groundbreaking scientific paper in Machine Learning: Science and Technology titled “Feature Space Reduction Method for Ultrahigh-Dimensional, Multiclass Data: Random Forest-Based Multiround Screening (RFMS)” by Gergely Hanczar, Marcell Stippinger, David Hanak, Marcell T Kurbucz, Oliver M Torteli, Agnes Chripko, and Zoltan Somogyvari. Published on: 19 October 2023 DOI: 10.1088/2632-2153/ad020e Volume 4, Number 4 In recent years, several screening methods have been published for ultrahigh-dimensional data that contain hundreds of thousands of features, many of which are irrelevant or redundant. However, most of these methods cannot handle data with thousands of classes. Prediction models built to authenticate users based on multichannel biometric data result in this type of problem. In this study, we present a novel method known as random forest-based multiround screening (RFMS) that can be effectively applied under such circumstances. The proposed algorithm divides the feature space into small subsets and executes a series of partial model builds. These partial models are used to implement tournament-based sorting and the selection of features based on their importance. This algorithm successfully filters irrelevant features and discovers binary and higher-order feature interactions. To benchmark RFMS, a synthetic biometric feature space generator known as BiometricBlender is employed. Based on the results, the RFMS is on par with industry-standard feature screening methods while possessing many advantages. r/IAMA - Oct 26 with the founders of Cursor Insight. https://bit.ly/AMAwithCursorInsight-GoogleCalendar ​ R/IAMA - Oct 26 with the founders of Cursor Insight. submitted by /u/CursorInsight [link] [comments]  ( 9 min )
    [N] New letter from Yoshua Bengio, Geoffrey Hinton, and others: Managing AI Risks in an Era of Rapid Progress
    Signatories include Turing Award winners Yoshua Bengio, Geoffrey Hinton, as well as others academics and experts. In 2019, GPT-2 could not reliably count to ten. Only four years later, deep learning systems can write software, generate photorealistic scenes on demand, advise on intellectual topics, and combine language and image processing to steer robots. As AI developers scale these systems, unforeseen abilities and behaviors emerge spontaneously without explicit programming1. Progress in AI has been swift and, to many, surprising. The pace of progress may surprise us again. Current deep learning systems still lack important capabilities and we do not know how long it will take to develop them. However, companies are engaged in a race to create generalist AI systems that match or ex…  ( 10 min )
    [D] Generative Food
    Hey guys, I sometimes post about tiny ML projects we work on. This time, we talk about applying language models for generating recipe titles/ideas. Specifically, we don't use LLMs, and this turns out to be a bit of a controversial decision, but one that has it's own advantages. Quite interested in the community's take on it: https://engineering.hellofresh.com/recipes-and-generative-ai-6d74a107860c submitted by /u/abnormdist [link] [comments]  ( 9 min )
    [R] Tokenizer Choice For LLM Training: Negligible or Crucial?
    📷Research https://arxiv.org/abs/2310.08754 While the recent success of LLMs has been driven primarily by curation of training dataset composition, scaling of model architectures and dataset sizes, and advances in pretraining features, the impact of tokenizers has often lagged as a blind spot. Our researcher*s study sheds light on this issue and shows that tokenizer choice can significantly impact downstream model performance as well as training and inference costs. 1️⃣ Investigation of intrinsic tokenizer performance, i.e., study of tokenizer properties (i.e., generated vocabulary), and tokenization results of tokenizers. 2️⃣ Investigate the extrinsic performance of the tokenizer, i.e., the impact of the tokenizer on the downstream performance of the model. 3️⃣ Investigation of possible correlation between intrinsic and extrinsic tokenizer performance. ​ 💡 The investigation shows that the common tokenizer evaluation metrics "fertility" and "parity" do not always predict the performance of the downstream model, making these metrics a questionable criterion for tokenizer evaluation. 💡 Moreover, the study shows that multilingual tokenizers - which are based on the five most common European languages - require a vocabulary size by a factor of three compared to English. The previous approach of training tokenizers with English vocabulary only thus turns out to be inefficient and results in a strong performance degradation and additional training costs of up to 68% submitted by /u/effi28_ml [link] [comments]  ( 9 min )
    [R] Using Machine Learning to Drive Portfolio Asset Allocations
    I'd love to hear your guys thoughts on next steps to improve this, maybe deeper layers and more nodes, maybe a random forest is more appropriate? I'd love to hear any thoughts on Machine Learning directly applicable to time-series data. https://www.quantitativefinancialadvisory.com/post/asset-allocation-in-a-post-modern-portfolio-theory-world-part-1-the-single-layer-taarp-ml-model The Main Idea We will develop a Machine Learning model, specifically a deep learning model (more hidden layers to come), to periodically, tactically rebalance the weights of our portfolio based on observable market data and empirically determined statistics combined with feature engineering from the past 21 trading days, and for the VIX we consider its characteristics since inception. The output will be a range representing the degree to which we bet long, short, or hold cash, and 3 weights that sum to less than or equal to one and greater than or equal to negative one. In essence we will allow shorting of securities and not require our portfolio to be fully invested. Cash is an active position; sometimes the best investment is staying on the sidelines. The model will allow one input layer, one and two hidden layers (to show that more might not always be better, explicitly with the 200 variable maximum excel solver imposes on us), and an output layer with 3 nodes outputting a value between -1 and +1 with -1 representing a full allocation to a short position in the security and +1 representing a fully allocated long position. submitted by /u/QFA_official [link] [comments]  ( 9 min )
    [D] Are people in ML Phds still happy?
    As an outsider who has many friends in ML Phds, this is my perspective of their lives: long hours, working nights, weekends no work-life balance, constant fear of being scooped and time pressure from deadlines frustrating broken review systems many incremental, advertisement papers that produce very little actual contribution (which is justified by 2.) "engineering" and not "science" all this pressure amounts to severe imposter syndrome Are people in the field still happy? Where do people get their satisfaction? To me it looks like almost like a religion or a cult. The select few who say, get neurips outstanding paper are promoted to stardom - almost a celebrity status while everyone else suffers a punishing work cycle. Are the phd students all banking on AGI? What else motivates them? Edit: the discussion is about whether 1-6 are worse in ML than other fields (or even the median experience). The reference for "other field" is highly heterogenous. Experience obviously varies by lab, and then even by individuals within labs. "It happens in other fields too" is a trivial statement - of course some version of 1-6 affects somebody in another field. Edit 2: small n but summarizing the comments - experience seems to differ based on geographic region, one's expectations for the phd, ability to exert work-life balance, and to some extent ignore the trends others are all following. Some people have resonated with problems 1-6, yet others have presented their own, anecdotal solutions. I recommend reading comments from those who claim to have solutions. submitted by /u/shenkev [link] [comments]  ( 9 min )
    [P] A PDF tool that supports three retrieval strategies, allowing users to choose the answer that suits them best
    ➡️ Check on https://huggingface.co/spaces/xuyingliKepler/VecDBCompare 📌 Introduction: VecDBCompare is a streamlit-based application designed to evaluate and compare three different vector database retrieval strategies. Users only need to upload a PDF and interact with QABots using three different strategies to determine which strategy is most suitable for them. ⭐️ Three retrieval strategies: Chunk Strategy: Divides the document into small chunks and retrieves based on the most relevant chunks. Summary Strategy: Summarizes the document and retrieves based on the summary content. Hypothetical Question Strategy: Generates hypothetical questions that the document might answer and retrieves based on these questions. submitted by /u/xuying_li [link] [comments]
    [D] [P] 3D Design file labelling and classification for manufacturing
    I have ~1 million 3D design (.STP and/or .OBJ) files of various parts for medical devices, aerospace, automotive or defense systems. I'd like to label them based on appropriate manufacturing methods that are used to physically make them. Some example methods and labels would be milling, turning, injection molding, cnc machining, etc. After labelling, I'd like to architect a system to produce these labels as inference for a new part that has not been physically made yet. My team (<5 people) have manufacturing domain expertise and can manually label these parts but I'm looking for a more scalable solution that isn't as time consuming. Crowd sourced methods like Mechanical Turk won't work because annotators do not have the domain knowledge to mark the correct label. Labelling platforms like SageMaker/Azure ML Studio only allow image/text/audio datasets, is there a platform that'll help me setup labelling tasks for 3D designs? Furthermore, how can I find more experts that can help scale this up? It seems to me that the only option is to build my own labelling app as an annotator needs these key features - 3D model visualizer so they can spin the part and view any orientation Draw a bounding box (commonly available in other platforms) Toggle measurements in inches/mm As for label classification I'm looking at architectures like PointNet since my dataset of meshes can be sampled to point clouds. Are there other methods that would work better or worth exploring? Open to any and all suggestions across this pipeline. ​ ​ submitted by /u/rootcage [link] [comments]  ( 9 min )
    [D] Undergrad seeking advice on ethics/ML research
    I’m an undergraduate who’s considering a PhD student in ML. I’m currently in a lab that focuses on ethics in AI. While I love the work, it focuses on the humanities side of CS. I’ve always been a more mathy person and have always been interested in theoretical ML research. I’d like to combine ethics & AI/ML in some way (eg studying explainable AI from the technical perspective). I was wondering what are some research areas that combine the two and if I don’t work in academia, what’s the market and job prospects like for someone who does this? submitted by /u/SnooChipmunks1902 [link] [comments]  ( 9 min )
  • Open

    Rewards in Montezuma's Revenge
    Hello all, I'm working on Montezuma's Revenge using the Gymnasium API. I wonder if there's anyone here that knows the numerical value of the rewards? And if so, how they are typically scaled down. ​ Thanks! ​ G_bes submitted by /u/G_bes [link] [comments]
    The N Implementation Details of RLHF with PPO
    submitted by /u/vwxyzjn [link] [comments]
    Creating a Custom Environment in Unreal Engine 5
    Hello, I would like to create my own environment (Maze), in which I would like to train my drone using reinforcement learning, I am kind of new and I don't know how can I set the state space, rewards, and if I would like to use BS3 for training then how can I connect the environment? And for the agent which is the drone, should i just do the AirSim build.cmd and take the agent from there and place the starting position flag or what? I am a bit lost and I can't find tutorials on how to do this, I'd appreciate it if you could provide some guidance. Thanks in advance. submitted by /u/Gabii99 [link] [comments]
  • Open

    Intelligent document processing with Amazon Textract, Amazon Bedrock, and LangChain
    In today’s information age, the vast volumes of data housed in countless documents present both a challenge and an opportunity for businesses. Traditional document processing methods often fall short in efficiency and accuracy, leaving room for innovation, cost-efficiency, and optimizations. Document processing has witnessed significant advancements with the advent of Intelligent Document Processing (IDP). With […]  ( 20 min )
    T-Mobile US, Inc. uses artificial intelligence through Amazon Transcribe and Amazon Translate to deliver voicemail in the language of their customers’ choice
    This post is co-authored by Dhurjati Brahma, Senior Systems Architect at T-Mobile US, Inc and Jim Chao, Principal Engineer/Architect at T-Mobile US, Inc and Nicholas Zellerhoff Associate Systems Architect at T-Mobile US, Inc. T-Mobile US, Inc. provides a Voicemail to Text service to its customers, which allows customers to quickly read through their voicemails and […]  ( 7 min )
  • Open

    DSC Weekly 24 October 2023
    Announcements Top Stories In-Depth The post DSC Weekly 24 October 2023 appeared first on Data Science Central.  ( 21 min )
    Seamless integration of data from unconventional source systems into Business Intelligence using data science techniques
    Written by Venkata Nori and Kshitij Gopali. Introduction As technology is evolving, most companies in the world are adopting advanced mechanisms for their daily tasks of storing/updating data, project management & tracking, incident management, version control, etc. Periodically, these companies’ business stakeholders would want to extract and analyze the data to see how the business… Read More »Seamless integration of data from unconventional source systems into Business Intelligence using data science techniques The post Seamless integration of data from unconventional source systems into Business Intelligence using data science techniques appeared first on Data Science Central.  ( 25 min )
    How data science and medical device cybersecurity cross paths to protect patients and enhance healthcare
    A recent interview by Medical Device Network with GlobalData medical analyst Alexandra Murdoch shares interesting insights into cybersecurity for medical devices. The post How data science and medical device cybersecurity cross paths to protect patients and enhance healthcare appeared first on Data Science Central.  ( 22 min )
    Skills required to excel in a business analytics career
    In the contemporary business landscape, where data is heralded as the new oil, Business Analytics has emerged as a pivotal domain, steering organizations towards informed decision-making and strategic planning. business analytics encompasses the utilization of data, statistical algorithms, and machine learning techniques to comprehend the business context, forecast future trends, and facilitate optimal decision-making. The… Read More »Skills required to excel in a business analytics career The post Skills required to excel in a business analytics career appeared first on Data Science Central.  ( 22 min )
    GenAI: The game-changer in data analytics
    In an era where data drives decisions, GenAI emerges as a prodigy force in the realm of data analytics. According to Statista, LLM’s market size is expected to show an annual growth rate of 24%, resulting in a market volume of $207 bn by the end of 2030.  This cutting-edge technology, built on sophisticated algorithms… Read More »GenAI: The game-changer in data analytics The post GenAI: The game-changer in data analytics appeared first on Data Science Central.  ( 22 min )
  • Open

    Animated AI
    submitted by /u/nickb [link] [comments]
  • Open

    Best of N series
    A couple days ago I wrote about the likelihood of the better team winning a best-of-five or best-of-seven series. That is, if the probability of X winning a game against Y is p > ½, how likely is it that X will win a majority of 5 games or a majority of 7 games. This […] Best of N series first appeared on John D. Cook.  ( 6 min )
    Lessons from Skylab
    I discovered the Space Rocket History Podcast a while back and listened to all the episodes on the Apollo program. I’m now listening to the episodes on Skylab as they come out. I came for Apollo; I stayed for Skylab. I would not have sought out the episodes on Skylab, and that would have been […] Lessons from Skylab first appeared on John D. Cook.  ( 6 min )
    Curvature: principal, Gauss, and mean
    This post will compute the center of curvature for an object described in the previous post. In order to do that we first need to describe principle curvature and Gauss curvature, and we’ll throw in mean curvature while we’re at it. Let S be a surface sitting in three dimensional space. No need for more […] Curvature: principal, Gauss, and mean first appeared on John D. Cook.  ( 6 min )
    An algebraic superegg
    One year ago I wrote about a variant of the squircle that is quantitatively close to the customary definition but that has nicer algebraic properties. That post used the term p-squircle for the usual squircle with equation where p > 2, and the term s-squircle for the variation with equation where s is between 0 […] An algebraic superegg first appeared on John D. Cook.  ( 5 min )
  • Open

    On Razer’s Edge: VFX Star Surfaced Studio Creates Stunning Sci-Fi World This Week ‘In The NVIDIA Studio’
    Visual effects artist Surfaced Studio returns to 'In the NVIDIA Studio' to share his real-world VFX project, created on a brand new Razer Blade 16 Mercury Edition laptop powered by GeForce RTX 4080 graphics.  ( 8 min )

  • Open

    [P] Traffic signs in ecognition developer
    Hi community, first time posting here. I'm working on a project for the segmentation and classification of traffic signs using eCognition Developer software. I need help with creating scripts to apply three classifiers: Naive Bayes, SVM, and Random Forest. I'd like to know how I can implement these classifiers in eCognition Developer and where to insert the scripts in the software. Does anyone have experience with this software and could share script examples or provide guidance on how to accomplish this task? Sorry English is not my first language. Tldr, i need to include the Bayes classifiers, Random Tree, and SVM in eCognition Developer (for segmentation and classification - prediction). submitted by /u/Dignai [link] [comments]  ( 9 min )
    [P] Using gpt4docstrings to generate docstrings for entire projects
    gpt4docstrings is a Python library that allows you to write docstrings for functions / classes non documented in your codebase. In this case, I'm applying the library to one module of langchain to see the results. Repo: https://github.com/MichaelisTrofficus/gpt4docstrings https://i.redd.it/78f3wit071wb1.gif submitted by /u/Hefty-Consequence443 [link] [comments]  ( 9 min )
    [P] DQN with a binary vector as output
    Heey everyone! I hope you're doing well. I need your help guys. I'm working on a DQN that outputs a binary vector of length L (I just applied sigmoid function on the ouptut layer and take p>0.5 as 1 and 0 otherwise). In this setting, at each decision time, the agent returns a list containing the indices of selected elements. Knowing that the list's length is dynamic how can I train my DQN ? (I am facing issues in this). Is there any alternative way to do this purpose (like DDPG :/ )? submitted by /u/GuavaAgreeable208 [link] [comments]  ( 9 min )
    [Project] Looking for AI/ML engineers to team up for a fallow deer identification project
    Hi, first of all, sorry for the cross post, but I guess Huggingface forums were not the right place to begin with and it took me a while to find out where things about AI/ML are being actively discussed. I am a professional software developer (C, Python on Linux) and while I did try out a few things with PyTorch and Diffusers - I am not an ML engineer, so I am looking for someone with ML expertise who’d be interested to team up for a non commercial open source project. I can do quite a lot around application development, but I clearly lack the required ML knowledge. I followed the free MIT ML courses on YouTube, did some reading, tried things out, but the ML part of this project is for sure over my head. So, here’s what I have in mind: I would like to create an application which would b…  ( 11 min )
    [D] Using SQL to monitor ML models
    Hello, We are running a number of machine learning models in production and would like to monitor some metrics during inference: Data quality, inference time, accuracy, etc. All these metrics could be recorded in the python code and we are planning to build a SQL database that will receive all the information so as we can visualize in grafana. Do you think this is a good pattern? What would you suggest instead (we are using AWS). Thank you in advance. submitted by /u/Eddas123 [link] [comments]  ( 9 min )
    [R][P] Trying to understand the generative properties of autoencoders
    A while back, I came across the "From Variational to Deterministic Autoencoders", which provided a novel insight into the generative properties of autoencoders by framing the objective through the lens of regularization. However, I couldn't help but notice that the deterministic models studied felt incomplete, namely due to the inherent lack of sampling in those models (which is something that the authors acknowledge). To provide a short recap of the paper, the authors surgically decompose the variational autoencoder objective into a deterministic one. They start with a Constant-Variance VAE, which is a special case of the general Gaussian latent VAE where the noise standard deviation of the latent distribution is fixed to 1. This leads to what is essentially a standard autoencoder with t…  ( 10 min )
    [D] What is the lowest possible loss for a language model?
    Example: Suppose a character-level language model (three input letters to predict the next one), trained on a dataset that contains three instances of the sequence aei, with two occurrences preceding o and one preceding u, i.e., the dataset is: Input Output aei o aei u aei o In this case, the ideal probability distribution for the model's logits for aei would be ~0.66 for o, ~0.33 for u, and zero for other letters. In other words, when the model is input with aei, the ideal softmax of the logits would be ~0.66 for o, ~0.33 for u, and zero for other letters. Following this reasoning, the objective is to optimize the model's output for a given input to match the distribution of occurrences in the dataset. If this reasoning is correct, then we have the following ideal loss (cross-entropy): https://preview.redd.it/pzpxogcqd0wb1.png?width=330&format=png&auto=webp&s=b0b6c3b5fbfb4797c11a1f26375065ce883551d3 Thus, ~0.63 is the smallest loss we can get with this dataset. Is my reasoning correct? submitted by /u/viniciusarruda [link] [comments]  ( 9 min )
    [D] Tanh activation function outputs the same value for any given input
    Basically im working on the DDPG algorithm in DRL where i have an actor and critic networks. The actor network architecture is quite simple: Input layer contains 22 neurons that represents the state values (ranging from 0.1 to 10.0 max not normalizing them) Two hidden layers with 128 neurons, with Leaky Relu activation (alpha = 0.01), and with HeUniform kernel initialzer Output layer with a single neuron has tanh activation, using Glorot kernel initialzer The critic network has the same architecture but we only concatenate the 22 state values with the action produced by the actor, the only difference is the ouput of the critic has no activation. And both networks use Adam. The problem arises when the training starts because i run a few steps without actually start the learning, but when the learning starts, the actor converges quickly to output values 1 or -1 afor any given input. I tried many learning rates for both actor and critic. One thing to note is when i set the actor learning rate to 1e-5 and the critic to 1e-3 the networks sometimes converges quickly, some time it takes longer to converge and sometimes it does not converge. submitted by /u/Desert_champion [link] [comments]
    [P] Fine-tuning VAEs on limited data
    I have been looking for a pre-trained VAE (on Imagenet with ResNet/VGG) or similar which I could fine-tune on my smaller dataset. However, not only there does not exist many such pre-trained weights but the practice of fine-tuning VAEs does not really seem mainstream. Is there a reason why VAEs are not pre-trained/fine-tuned? Does it have to do with posterior collapse? submitted by /u/unholy_sanchit [link] [comments]  ( 9 min )
    [D] Smart pooling for Visual Transformers
    There is an architecture for images/videos called MViT, where 2D MaxPooling layers are added to reduce computations for ViT. But MaxPooling has a drawback - it discards information independently of context, equally discarding information from both important and uninformative parts of the image. For traditional Conv2D networks, there's little we can do about this, but for transformers, we can reduce dimensionality in a more meaningful way - discarding only those elements that don't carry unique information. Are there any articles/developments on this topic already? submitted by /u/Dependent_Bluejay_45 [link] [comments]  ( 9 min )
    "[Research]" RVC AI Training
    Hello, I'm currently using RVC AI, and I'm about to record myself for the training. What is the best way to record myself except the singing and talking at least for 15 minutes like the guide says. Do I have to make it 20 min and one audio file or do I have to make it 20 min and maybe 10 files with 2 minutes each file? Also, can I multiply my files and reach the 15-20 minutes of audio that it's required or I have to make a different talking or singing for every audio? submitted by /u/WeldFrenzy [link] [comments]  ( 9 min )
    [D] RAG oriented fine-tune... Searching for coherence
    Still searching for a model that is well enough to make RAG... Lots of good models on huggingface, but none of them is trained to return extracted text or answers based on provided info without hallucinating something. Is quite frustrating, every week came a new version of a model that is amazing for Role play and storytelling... (some good progress also on coding...) I see lots of efforts in different RAG strategy, improving semantic search and Chunking, but the open source community still does not have a decent model fine tuned for that. I have considered the idea of make that fine tune, based on synthetic data (using Wikipedia as knowledge base), but unfortunately I have not enough funds to cover the api cost neighter to pay for some decent Gpu. I'm not going to train a 7B Model because the under 30B imho doesn't have many sense if the coherence is the main requirements. Unpopular opinion: as coherence, code llama 34B is much better to any of the 70B fine tune. Sorry to everyone for the rant... Does anyone have some tips or suggestions? Thanks in advance! Edit: My database is composed mainly by abstracts of papers and medical textbook. I admit that the domain is quite complex, but the error rate is too high. Obviously that even if prompted to avoid that (tried and refined multiple prompts, using different prompt format). Gpt3.5, Claude instant and Palm2-Bizon work fine for that task. (obviously GPT4 and Claude 2 would be best, but too expensive for me) I spent lots of time to make a solid embedding pipeline: advanced chunking, Metadata added by llm, text for similarity search different from text provided to LLM, instructor bi encoder to generate embeddings(INSTRUCTOR-XL), reranking using cross encoder, RAG-Fusion using multiple query and HyDE approach Hybrid search with BM25 So... I'm a bit frustrated that i can not run all locally, became that is a must for my project. submitted by /u/Distinct-Target7503 [link] [comments]  ( 10 min )
    [R] 2x the context length of ALiBi through position interpolation
    https://arxiv.org/abs/2310.13017# Linear position interpolation helps pre-trained models using rotary position embeddings (RoPE) to extrapolate to longer sequence lengths. We propose using linear position interpolation to extend the extrapolation range of models using Attention with Linear Biases (ALiBi). We find position interpolation significantly improves extrapolation capability on upstream language modelling and downstream summarization and retrieval tasks. submitted by /u/jwan584 [link] [comments]  ( 9 min )
    [D] How to make research publication more reproducible?
    As context, I'm personally working on a project to make ML/AI research publication more reproducible. We're backed by Balaji Srinivasan (https://twitter.com/balajis) at the level of funding and advice. It seems like, despite attempts like Jupyter Notebooks or sites like Papers with Code, most published research in ML still isn't setup to be easily reproducible. Even companies like Anthropic/OpenAI don't put much of an emphasis on reproducibility, even though it's in their interest to do so to earn public trust. Our current hypothesis is to conceptualize reproducible research as software testing. Specifically we're thinking of building tools that let you internally test the robustness of results, and externally publish them s.t. they're reproducible. You can think of it as continuous integration for reproducible research; e.g. BuildBot for Reproducible Research. One specific idea I have is to build a model evaluation/testing platform that lets you: Internally eval LLM models on open benchmarks (TruthfulQA, AGIEval, etc.) Test robustness of results under different assumptions Externally publish reproducible results I don't have a background in ML research. So I'm looking to get input from research engineers on what challenges/barriers currently exist with model testing and publishing reproducibly — so I thought I'd reach out in this community if anyone's open to that! Let me know if this post doesn't conform to the rules, or if this should go somewhere else. submitted by /u/manveerbasra [link] [comments]  ( 9 min )
    [P] Image Captioning Model
    Hello everyone, I am currently trying to find suitable image captioning and visual question answering models to implement in my project. After a quick google search I came across BLIP2 from hugging face however, its a very large model overall and both my pc and colab could never load its lightest pretrained version. Does anyone know any similar pretrained models for the specific tasks or any other way to load this kind of large model? (I tried loading it with 8bit precision which still failed) I have 16gb of RAM and the task requires image captioning and the ability to ask the model details about the specific image. Any help is greatly appreciated!! submitted by /u/Spitefulsalamander [link] [comments]  ( 9 min )
    [D] Episodic Training vs. Random Sub-Sampling in Few-Shot Learning
    I'm new to few-shot learning and I'm having trouble understanding why prototypical networks use a random sub-sampling approach while the vanilla few-shot learning approach uses episodic training. Doesn't random sub-sampling fail to guarantee that data overlapping won't occur? submitted by /u/The_Aoki_Taki [link] [comments]  ( 9 min )
    [D] High-temperature softmax
    I implemented a label propagation algorithm which is mainly used in the field of Video Object Segmentation (VOS). Basically I provide the labels for one frame and ask my model (using pre-trained encodings of frames) to do semantic segmentation on all the other frames of a video. I am obtaining consistently better results using an high temperature softmax when computing the similarity between pixels of different frames. Then the top-k similarities of each pixel (features) are used to propagate the labels from one frame to the next. I will not disclose the dataset I am using but let's say it is noisy (let's say also low quality). I want to understand why an high-temperature softmax performs better than a softmax with T=1 or an extreme T = 0.01. At the moment I get better results with T = 10, 100 and the trend in my grid search shows that even higher T could be possible. I was wondering if the model is still considerable valid if T is too high. I feel like the model is almost randomly guessing, if T is too high, but this apparently enhances performance. Every help is appreciated. Also literature about the topic! I only found one paper (which uses an high-temperature softmax to distill knowledge in a student-teacher network for remote sensing imagery) submitted by /u/darthjeio [link] [comments]  ( 9 min )
    [D] Callbacks in tensorflow v1
    Hi everyone, I have some old code written in tf1. It has not been ported to tf2 or pytorch yet. Does anyone of you have leads on whether one can implement custom callback for tf1 code and if there are any examples on the web? Thanks in advance. submitted by /u/wrik003 [link] [comments]  ( 9 min )
    [N] CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages
    In the tech report of GPT4, an analysis was conducted on the impact of different languages on model performance. These effects are attributed to the amount of data and language characteristics. This also indicates that the model's effectiveness may not meet the expectations of users in different languages. The problem addressed in this paper is of significant importance. https://preview.redd.it/s48419fe9yvb1.jpg?width=2748&format=pjpg&auto=webp&s=ba76f1bd18043c6cb2610ed90f5c41a78b5ccd95 Arxiv: https://arxiv.org/abs/2310.13683v1 Stay updated with AI in a fun-to-listen way. Check out ai-dailynews.com to generate your personalized news podcast🎙. It's one of my open-source projects and takes no charge. submitted by /u/xuying_li [link] [comments]  ( 9 min )
    [N] Neural-Base Music Generation for Intelligence Duplication
    The paper employs a deep learning system to learn from the great composer Beethoven and capture his composition ability in a hash-based knowledge base. This new form of knowledge base provides a reasoning facility to drive the music composition through a novel music generation method. https://preview.redd.it/l9gzcoe38yvb1.png?width=1944&format=png&auto=webp&s=d6c5ca7f8fe434be1187c1f0440c5a94ebfc9b64 Arxiv: https://arxiv.org/abs/2310.13691v1 For more AI updates, check out this AI-generated news podcast🎙 tailored to your preferences(ai-dailynews.com), which is open source and free. submitted by /u/xuying_li [link] [comments]  ( 9 min )
    [D] Biclustering with the same row and column clusters
    The biclustering algorithm partitions rows and columns of a matrix into clusters so that the variance inside each intersection between row and column clusters in minimized. I want to perform the biclustering of a matrix, but additionally to enforce that the row and column clusters are the same, i.e. if the row i lies inside a row-cluster c then the column i must lie in a column-cluster c. Rows and columns in the matrix represent the same entities (but the matrix is non-simmetric). sklearn implementation does not support such a constraint. Are there any algorithms for this at all? submitted by /u/Tomarchelone [link] [comments]  ( 9 min )
    [D] Referenceless NLP Evaluation
    Hey all, I'm building this open source project that helps ML engineers evaluate LLM applications (its like unit testing for LLMs), and it works great in development since users can just write a test_file.py like how you would normally do it in pytest, but as I'm going onto the next phase I'm thinking how to bring evaluation to production, especially on metrics such as factual consistency where I need a ground truth. I'm hoping to get some ideas around this. Here's a link to the repo (https://github.com/confident-ai/deepeval) if you want more clarity on what the package looks like, but most importantly any help to brainstorm production evaluation will be greatly appreciated. Thank you very very much! submitted by /u/Ok_Constant_9886 [link] [comments]  ( 9 min )
    [D] Is Computer Vision dead? - “Quo Vadis, Computer Vision?”
    In ICCV23, several top notch researchers shared their insights (in a workshop called “Quo Vadis, Computer Vision?”) wrt the current state of Computer Vision, especially in light of the meteoric raise of LLMs. Has CV stalled? Is CV dead? E.g.MIT’s professor Bill Freeman, has some interesting points on foundation models: “FM aren’t fundamental, therefore not stable". Jitendra Malik argues "video can describe the world better than text." submitted by /u/btcmx [link] [comments]  ( 9 min )
    [R] Biologically plausible vision models for classification and grasping tasks
    Hey everyone! I am looking for papers that propose or explore biologically plausible vision models, primarily tasks like classification and grasping (predicting grasping bounding boxes) tasks. By biologically plausible, I mean papers that propose models inspired by the human brain in some way or the other. I know convolution is loosely inspired by human cognition, but everything I can find seems to suggest the opposite for ViT like models. I have come across certain papers like these: - https://arxiv.org/abs/1901.00945 - https://proceedings.neurips.cc/paper/2020/hash/98b17f068d5d9b7668e19fb8ae470841-Abstract.html But I am still looking for more. Any suggestions? submitted by /u/Far_Clothes_5054 [link] [comments]  ( 9 min )
    [D] Understanding the math behind diffusion models
    I was trying to comprehend the math behind this paper: https://arxiv.org/pdf/2006.11239.pdf. You can see in the equation corresponding to the forward diffusion process, at each time step, the image in the previous step is also scaled by sqrt(1-beta_t) while adding noise. It seems like the purpose of this is to maintain a fixed variance (or specifically, unit variance) at each time step. My question is: What is the significance of maintaining unit variance at each time step? Why is this useful? I saw somewhere that this is done to prevent the variance from "exploding." I don't really know what this means. I guess the variance keeps on increasing if the scaling isn't done. But why is this bad? submitted by /u/fallendeviL701b [link] [comments]  ( 9 min )
    [D] Neural Attention - One simple example that explains everything you need to know
    submitted by /u/AvvYaa [link] [comments]  ( 9 min )
    [D] Has anyone tried deploying FastAPI v2 with a BERT model on the NVIDIA Triton Inference Server?
    I'm not sure how to enable BERT with flash attention during the start-up of the Triton server in order to accelerate inference. Dao(the author of FA) told me he’s never tried. submitted by /u/g14loops [link] [comments]  ( 9 min )
  • Open

    Etsy Taking Stores Down as it's Bot Can't Tell Which Mockups are Real and Which ones are AI Generated
    If you are an Etsy seller or know someone who sells on Etsy, or maybe you went on Etsy and your favorite store is gone, could be due to the Etsy bots taking down stores for not figuring out properly which Mockup Images are real and which ones are AI Generated. All you have to do to find this out is go on youtube or social media and look for "etsy mockups news". Also Etsy has been pretty quiet about this and as a result Etsy sellers are going crazy about this as no one knows why some stores who haven't used AI to create their mockups are being targeted by these bots. This just goes to show how hard is getting to distinguish between what is real and what is AI generated and how across all industries companies are having issues adapting to AI technology changes. Thoughts? submitted by /u/fk1220 [link] [comments]
    New data poisoning tool lets artists fight back against generative AI
    Nightshade is a new data poisoning tool that allows artists to fight back against generative AI models. By adding invisible changes to the pixels in their art, artists can cause chaos and unpredictable results in AI models that use their work without permission. The tool, called Nightshade, is intended as a way to fight back against AI companies that use artists’ work to train their models without the creator’s permission. Using it to “poison” this training data could damage future iterations of image-generating AI models, such as DALL-E, Midjourney, and Stable Diffusion, by rendering some of their outputs useless—dogs become cats, cars become cows, and so forth. AI companies such as OpenAI, Meta, Google, and Stability AI are facing a slew of lawsuits from artists who claim that th…
    I would like to upload 100+ one-hour-long podcasts in MP3 and get a 1-page summary of the most important points discussed in each episode — what's the best way to go about doing this?
    ChatGPT and Bard are cool, but I have to manually feed them transcripts generated by Whisper to get summaries. Furthermore, since the length of the transcript is often longer than the maximum character limit(s), I have to add additional prompts in between copying and pasting multipart transcripts. Since these recordings are 10–15 years old, the audio quality isn't the best, but I think it's sufficient to generate transcripts + detect speech, if not, I might need an additional "audio cleaning" step as well. I don't mind paying, and I'm above average in technical ability, so if anyone has any suggestions, I'd love to hear them. Here's what the workflow would look like: INPUT: I will upload a folder containing 100+ MP3 files of podcasts with below-average audio quality. OUTPUT: I would like to get a Google Doc or a Text file with 1-page summaries of the most important points in bullet-point format corresponding to each episode. Each page should be separated by some sort of divider, and the header should contain the filename for reference. Ideally, there should be an existing Jupyter Notebook I could throw in Google Colab and do all of the above in a plug-and-play manner, but if not, I'd love to hear your thoughts. Any tips? Thanks! submitted by /u/aknalid [link] [comments]
    The dilemma of potential AI consciousness isn't going away - in fact, it's right upon us. And we're nowhere near prepared. (MIT Tech Review)
    https://www.technologyreview.com/2023/10/16/1081149/ai-consciousness-conundrum/ "AI consciousness isn’t just a devilishly tricky intellectual puzzle; it’s a morally weighty problem with potentially dire consequences. Fail to identify a conscious AI, and you might unintentionally subjugate, or even torture, a being whose interests ought to matter. Mistake an unconscious AI for a conscious one, and you risk compromising human safety and happiness for the sake of an unthinking, unfeeling hunk of silicon and code. Both mistakes are easy to make." "Every expert has a preferred theory of consciousness, but none treats it as ideology—all of them are eternally alert to the possibility that they have backed the wrong horse." "The trouble with consciousness-­by-committee, though, is that this state of affairs won’t last. According to the authors of the white paper, there are no major technological hurdles in the way of building AI systems that score highly on their consciousness report card. Soon enough, we’ll be dealing with a question straight out of science fiction: What should one do with a potentially conscious machine?" "For his part, Schwitzgebel would rather we steer far clear of the gray zone entirely. But given the magnitude of the uncertainties involved, he admits that this hope is likely unrealistic—especially if conscious AI ends up being profitable. And once we’re in the gray zone—once we need to take seriously the interests of debatably conscious beings—we’ll be navigating even more difficult terrain, contending with moral problems of unprecedented complexity without a clear road map for how to solve them." submitted by /u/kamari2038 [link] [comments]
    The Future of AI Voice Technology
    submitted by /u/Amandacerni [link] [comments]
    UK officials use AI to decide on issues from benefits to marriage licences
    submitted by /u/sky_badger [link] [comments]
    One-Minute Daily AI News 10/22/2023
    A new AI agent Eureka developed by NVIDIA Research that can teach robots complex skills has trained a robotic hand to perform rapid pen-spinning tricks — for the first time as well as a human can.[1] Meta’s Habitat 3.0 simulates real-world environments for intelligent AI robot training.[2] South Korea’s SK telecom Co. will collaborate with Deutsche Telekom AG to jointly develop a telecommunications-specific artificial intelligence (AI) large language model (LLM) as competition intensifies among local telecom companies to expand overseas with their own AI capabilities.[3] Scientists say they have built an artificial intelligence (AI) tool that can successfully identify and confirm supernovas.[4] Sources: [1] https://blogs.nvidia.com/blog/2023/10/20/eureka-robotics-research/ [2] https://siliconangle.com/2023/10/20/metas-habitat-3-0-simulates-real-world-environments-intelligent-ai-robot-training/ [3] https://pulsenews.co.kr/view.php?year=2023&no=810112 [4] https://learningenglish.voanews.com/a/researchers-build-first-tool-to-discover-supernovas/7318435.html submitted by /u/Excellent-Target-847 [link] [comments]
    How To Earn $1M+ By Using AI To Write Books
    I've been using ai for a long time, it often helps me to reduce my work time, but I want to try to earn money and decided to make an investigation. I want to hear your opinion on my analysis, and maybe this post will help someone in starting a business through ai Joe Popelas, a very young entrepreneur, has made over a million dollars within the last year selling AI-generated books online. I literally got fascinated by how simple yet powerful it is with these tools to create a book within a matter of a few hours. Joe Popelas is one of a new breed of AI entrepreneurs who capitalized on the democratization of large language models. Joe's story demonstrates the power of combining human creativity with AI. While AI tools did the heavy lifting for his initial drafts, Joe spent time refining …
  • Open

    From text to dream job: Building an NLP-based job recommender at Talent.com with Amazon SageMaker
    This post is co-authored by Anatoly Khomenko, Machine Learning Engineer, and Abdenour Bezzouh, Chief Technology Officer at Talent.com. Founded in 2011, Talent.com is one of the world’s largest sources of employment. The company combines paid job listings from their clients with public job listings into a single searchable platform. With over 30 million jobs listed […]  ( 12 min )
  • Open

    Street View to the Rescue: Deep Learning Paves the Way to Safer Buildings
    Images such as those in Google Street View are taking on a new purpose in the hands of University of Florida Assistant Professor of Artificial Intelligence Chaofeng Wang. He’s using them, along with deep learning, in a research project to automate the evaluation of urban buildings. The project aims to help governments mitigate natural disaster Read article >  ( 6 min )
  • Open

    How to properly evaluate competitive MARL?
    Hello, everyone! I'm building a MARL agent for a zero-sum game and I'm having a hard time evaluating it. I managed to quickly train it for a simple case and I could manually verify that it was actually learning the optimal decision making because I already know how the game works and, for this simple case, I know that there actually is a mathematically correct way to play it (from both sides) and how it should be played, but that isn't true for most cases (and even if it was, I wouldn't be able to manually verify thousands of games). To complicate things even more, there are billions and billions of possible initial states. For single-agent RL, I could set a reward threshold (if I knew which was the maximum reward possible) or at least I could set a maximum time of "no improvement" but, in a zero-sum game, the sum of the policy rewards is, well, zero. I could think of two solutions: Evaluate convergence to Nash Equilibrium on a subset of the possible initial states, which could be a problem because I'm not sure if the game dynamics guarantee the existance of Nash Equilibria; Evaluate convergence of the winrate of the trained agent against a "hand-crafted" baseline agent, which could be a problem because the quality of this evaluation method could depend on how well I can make this baseline agent (which won't be even close to optimal, otherwise I wouldn't be training an agent). Any thoughts? submitted by /u/victorsevero [link] [comments]
    Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
    submitted by /u/gwern [link] [comments]
    In your opinion, which is the most beautiful form of the Bellman Equation and why?
    Didn't see anything about this kind of post in the rules I'm asking for a tattoo idea haha submitted by /u/victorsevero [link] [comments]
    Inverted pendulum swing-up problem not converging to global optimum using SAC or TD3.
    I am making a thesis about using RL to solve the inverted pendulum swing-up problem. I have tried using TD3, SAC, and TD3-Fork. In my testing, TD3-Fork worked best, I think SAC would also work if I am able to tune the hyperparameters correctly. I would like a similar trained agent to td3 converged where the agent balances the pole almost indefinitely. I have tried the hyperparameters from the website and also different hyperparameters but it has not converged. I am wondering if I am missing something or if there is anything I can do to improve the agent. I have been thinking of using HER instead of FORK. Any help or advice would be appreciated. training reward data The 'maximum' reward that I could get in the simulation is >880. The reward function that I used is -[cos(theta) + 10(|x| > 0.9) + 10(|theta_dt| > 18)]. However, from the data above it only converges to about 837 max and rarely reaches >900. trained td3 fork agent submitted by /u/YEEETTT0708 [link] [comments]
    Godot enables me to do pure C# Deep reinforcement learning.
    submitted by /u/Vae94 [link] [comments]
    [R] Demo of “Flow-Lenia: Towards open-ended evolution in cellular automata through mass conservation and parameter localization” (link to paper in the comments)
    submitted by /u/gwern [link] [comments]
  • Open

    Celebrating Kendall Square’s past and shaping its future
    The 15th Kendall Square Association annual meeting explored new and old aspects of the neighborhood.  ( 9 min )
  • Open

    Nonlinear algebra
    What is nonlinear algebra? Negations are tricky. They may be the largest source of bugs in database queries. You have to carefully think about what exactly are you negating. Any time you see “non-” attached to something, you have to ask what the context is in which the negation takes place. For example, if you […] Nonlinear algebra first appeared on John D. Cook.  ( 6 min )
  • Open

    Are Generalized Self-Supervised ViT Models the Image Objective Counterpart of LLM’s?
    submitted by /u/No-Platypus4021 [link] [comments]
    Neural Networks: A Deep Dive into AI's Building Blocks
    submitted by /u/Emily-joe [link] [comments]
  • Open

    Abstracts: October 23, 2023
    Today on “Abstracts,” Partner Research Manager Andy Gordon & Senior Researcher Carina Negreanu explore new work introducing co-audit, a term for any tool-assisted experience that helps users of generative AI find and fix mistakes in AI output. The post Abstracts: October 23, 2023 appeared first on Microsoft Research.  ( 16 min )

  • Open

    [P] Having GPT-4 Iterate on Unit Tests like a Human
    Hi r/MachineLearning, My name is William and I’m one of the founders of Sweep. Sweep is an AI junior developer that writes and fixes code by mirroring how a developer works. While building Sweep, we used to use the Github API, but we ran into rate limits, so we changed this to clone your repository for the duration of the request. It's now coming full circle. Sweep can now write, run, and debug a failing unit test for the ClonedRepo class! Blog: https://docs.sweep.dev/blogs/ai-unit-tests Video: https://www.youtube.com/watch?v=N9PUxmja9z4 submitted by /u/williamsweep [link] [comments]  ( 9 min )
    [D] Structured learning resources for ML Theory
    So essentially what the title says. I want to truly understand whats happening behind Machine Learning in general and also behind each algorithm specifically (starting from the basics to more advanced things, like Logistic Regression, Decisions trees and random forests, Deep Learning, NLP, GANS...). By structured I mean it contains all the pieces ordered and organized, from the same source, so you can can actually go from the building blocks up, not just a YouTune channel that uploads interesting videos about different machine learning related topics. Regarding the medium, I don't really mind but I would prefer audiovisual content (YT channel/playlists, Lectures, conferences...) but if you really recommend a specific book or series of books that's also okay. If it has some practical focus to it (to better grasp the theory) that would great. Also, I would prefer if it goes deep into the details, but not too deep into the specific maths involved, but if it's the case thats also okay. Regarding price, obviously if it's free that would be awesome, but in the range of free to 40€ is fine. Thank you for your recommendations in advance!! submitted by /u/aleradamantis [link] [comments]  ( 9 min )
    URL PHISHING OR BENIGN USING DEEP LEARNING "[Research]", "[R]", "[Project]", "[P]"
    Guys does anyone have an idea why my model does not work and it's like 50-50 chance to get it right. I'm getting really frustrated. Here is the code so far: ​ import pandas as pd import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader from collections import Counter from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score import re from imblearn.under_sampling import RandomUnderSampler # Loading the data file_path = "C:/Users/alex/Desktop/DATASET/malicious_phish.csv" data = pd.read_csv(file_path) # Filtering data filtered_data = data[data['type'].isin(['phishing', 'benign'])] # Undersampling the majority class rus = RandomUnderSampler(rand…  ( 12 min )
    [D] - Pre-Training a 4bit model (NOT Fine-tunning)
    Pre-Training using 4bit (NOT fine-tunning) Hello community! I have been messing around with open source LLM's running them locally using peft and AutoGPTQ in Transformers. I even trained a few QLora models (my favorite part) However my question is this, given the performance of a 4bit model why hasn't there been any research in this area? Is it possible to even create a new model using 4bit altogether? I am sure it's not as easy as it sounds but I haven't seen anyone try. Just curious cause it will open doors for many of us with consumer grade hardware. Thanks! submitted by /u/Delicious-Farmer-234 [link] [comments]  ( 9 min )
    [P] Infinity, a FOSS project for supporting RAG for LLMs and Vector Embeddings.
    https://github.com/michaelfeil/infinity Infinity, a open source REST API for serving vector embeddings, using a torch / ctranslate2 backend. Its under MIT License, fully tested and available under GitHub. I am the main author, curious to get your feedback. FYI: Huggingface launched a couple of days after me a similar project ("text-embeddings-inference"), under a non open-source and non-commercial license. submitted by /u/OrganicMesh [link] [comments]  ( 9 min )
    [R] Combining Thermodynamics and Diffusion Models for Collision-Free Robot Motion Planning
    Researchers from Yonsei University and UC Berkeley recently developed a new AI method for enabling autonomous robots to navigate unfamiliar environments filled with obstacles using only visual data as input. The key innovation is a customized diffusion model. Diffusion models can generate diverse motion plans by adding controlled noise. The researchers tailored the model to mimic how heat avoids insulation when dispersing through space. Similar to heat navigating around insulators, this "collision-avoiding" diffusion model learns to predict robot motions that avoid collisions with obstacles. It generates reachable goals and viable motion plans to those goals simultaneously. In simulations, this approach achieved ~98% success rates in navigating to target destinations while avoiding randomly generated obstacles using only visual map images as input. While extensive real-world testing is still needed (only 2D, only simulation), these initial results showcase promising capabilities: Enables navigation in unfamiliar environments without pre-mapping. Flexibly identifies and progresses toward reachable goals. Avoids unnecessary sensing systems for obstacle avoidance. Learns complex collision avoidance heuristics from visual data. I like the thermo + AI + robotics combination here - takes me back to my days in aerospace engineering. Pretty interesting approach. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [R] Speeding up open source LLMs with speculative decoding
    submitted by /u/firef1y1 [link] [comments]  ( 9 min )
    [P] Open Source AI repos that caught my 👀 this week
    @MetaGPT_ github.com/geekan/MetaGPT - multi agent collaboration - MetaGPT encodes Standard Operating Procedures (SOPs) into prompts. The claim is that it takes a one line requirement as input and outputs user stories / competitive analysis / requirements / data structures / APIs / documents, etc. @Ollama_ai github.com/jmorganca/olla… - run large language models locally. The future of AI/LLMs may not be on the cloud, but on your own laptops/mobiles. ollama.ai/blog/building-… @huggingface github.com/huggingface/ca… - slick ML framework for Rust with a focus on performance (including GPU support) @remilouf github.com/outlines-dev/o… - helps developers guide text generation to build robust interfaces with external systems. Provides generation methods that guarantee that the output will match a regular expressions, or follow a JSON schema. github.com/YiVal/YiVal enterprise AI platform submitted by /u/oana77oo [link] [comments]  ( 9 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 9 min )
    [D] Teachers struggle to adapt amid AI revolution in education
    submitted by /u/DutchTechJunkie [link] [comments]  ( 9 min )
    [R] Language interaction to assist in composing and refining music
    Hey guys, I found an interesting paper recently. Universities in the UK introduced Loop Copilot, enabling users to generate and iteratively refine music through an interactive, multi-round dialogue interface. Using language interaction to assist in composing music is very appealing, AI makes a complex workflow easy and automated. https://preview.redd.it/nzvlgevcarvb1.jpg?width=998&format=pjpg&auto=webp&s=815e0f48c299831700215ebdb4257423e317f5ec submitted by /u/xuying_li [link] [comments]  ( 9 min )
    [D] How to account for extreme periods in time series forecasting?
    I am performing a (machine learning) time series forecast on monthly data from the last 20 years. If I separate my data into a train, validation, and test set, the validation set is almost completely filled with extreme values due to the Covid period. How to account for this? submitted by /u/Ambitious-Pay6329 [link] [comments]  ( 9 min )
    [P] Graphing emotion events with LMs for in-depth sentiment analysis
    submitted by /u/helliun [link] [comments]  ( 8 min )
    [D] DINOv2 Breakdown: I've Created a Visual Guide to the Model's Design & a Concise Code Walkthrough
    submitted by /u/CkmCpvis [link] [comments]  ( 9 min )
    [R] Do you read ML/DL/AI related scientific papers? How do you filter them?
    As the title says. Recently, I found a review paper where the authors showed an exponential growth of published papers related to ML or DL. I was wondering if you even read those. If yes what's your way to find good and reliable papers? Do you choose only ones with a significant number of citations? Or just strictly related to your field? If no, why not? https://preview.redd.it/jwjvej5f5qvb1.jpg?width=1080&format=pjpg&auto=webp&s=bf3f7e08e0fe09fe0c6a6fd8d194945b45f5858e submitted by /u/hahahaczyk [link] [comments]  ( 9 min )
    [R] Open-Source Projects on Detecting Landmines
    I know that there are a lot of efforts at the moment to improve the algorithms used for landmine detection. Is anyone aware of any ongoing open-source projects in this space? submitted by /u/Eightstream [link] [comments]  ( 9 min )
    [R] Demo of “Flow-Lenia: Towards open-ended evolution in cellular automata through mass conservation and parameter localization” (link to paper in the comments)
    submitted by /u/hardmaru [link] [comments]  ( 9 min )
    German researchers create DeepMB for faster, high-quality optoacoustic imaging [N]
    Researchers from Germany have developed DeepMB, a groundbreaking deep-learning framework enabling high-quality and real-time optoacoustic imaging via multispectral optoacoustic tomography (MSOT). With potentially transformative implications for health care, this innovation might redefine medical imaging standards. To stay ahead of developments in AI, look here first. DeepMB breakthrough DeepMB resolves the longstanding tradeoff between image quality and speed in medical imaging. The deep-learning framework uses a deep neural network for model-based reconstruction, allowing for fast, high-quality imaging. DeepMB can reconstruct images approximately 1000 times faster than conventional techniques, with virtually no loss in image quality. Impressive metrics and implications The researchers accomplished accurate optoacoustic image reconstruction in just 31 milliseconds per image by training the system to pairingly synthesize optoacoustic signals with ground-truth images. DeepMB promises to equip clinicians with immediate access to high-quality MSOT images, regardless of the patient's condition or scanned body area. The technology could extend to other imaging modalities, such as ultrasound, x-ray, and MRI, potentially changing how diseases are diagnosed and treated. Exciting prospects The development of DeepMB is a significant leap in optoacoustic imaging, promising to enhance healthcare outcomes. As DeepMB evolves, it could become integral to modern medical imaging, delivering high-quality results at previously unattainable speeds. (source) P.S. If you like this kind of analysis, I write a free newsletter that unpacks the most significant news and research in AI. Google, Meta, and OpenAI professionals are already subscribed submitted by /u/orthomax23 [link] [comments]  ( 9 min )
    [D] ForeCastNet. Neural PDEs perform global weather simulation 4 to 5 orders of magnitude faster than traditional numerical methods.
    submitted by /u/moschles [link] [comments]  ( 9 min )
    Data labeling service for keypoints / pose [D]
    I was previously using scale.ai but they have been extraordinarily slow. Does anyone have recommendations for services to label keypoints or pose? Bonus points if the labeling service is able to handle 3D / multi angle data coming from multiple cameras. I work in an academic lab and scale is <10k images per batch. submitted by /u/researchrig [link] [comments]  ( 9 min )
    [D] Need help with text-to-song diffusion model architecture
    Hey, I want to make a text-to-song diff model, but I can't figure out the architecture I have already prepared a dataset of about 5000 songs of different genres, artists. It only contains the lyrics, the genre and the song itself Do I understand correctly that I should just encode the text and genre into one vector using CLIP and hope that the model will directly follow it (not skipping words and lines), or should I somehow make timestamps in the dataset (when, where and what text is sung)? I was inspired by Chirp V1 submitted by /u/Head-Selection-9785 [link] [comments]  ( 9 min )
  • Open

    IBM's NorthPole chip runs AI image recognition 22x faster than current chips
    IBM has developed a chip called NorthPole that runs AI-based image recognition 22 times faster than current chips on the market. The chip uses a two-dimensional array of memory blocks and interconnected CPUs to process data quickly. However, it can only run specialized AI processes and not training processes or large language models. The researchers plan to test connecting multiple NorthPole chips together to overcome this limitation. Source : https://techxplore.com/news/2023-10-ibm-northpole-chip-ai-based-image.html submitted by /u/NuseAI [link] [comments]
    Email Ai
    is there a website or some Ai to help me clean my inbox, stop receiving emails from certain senders etc etc... I've heard about: Sanebox for keeping your inbox organized Mailbutler for gathering contact details and tasks EmailTree for creating AI-powered workflows But they are paid and I'm looking for free alternatives submitted by /u/JOTA-137_0 [link] [comments]
    Microsoft CEO Satya Nadella talks AI, closing the Activision Blizzard deal, and his best business decision so far
    submitted by /u/thisisinsider [link] [comments]
    Medical Student Question: Why aren't there any programs that do differential diagnosis for doctor?
    Based on input you have. This would be like an enterprise software level program I guess and you would input history and then through trawling through data locally it can generate diseases and probability patient has each disease based on data inputted Why doesn't something like this already exist? I am learning how to do differential diagnosis now and it seems use extremely rudimentary understanding of probability to diagnose things. You use clusters of symptoms and then use tests to eliminate stuff in the differential. It just seems like low hanging fruit that a program could do using tech we already have (I imagine LLMs will make it easier) submitted by /u/derpgod123 [link] [comments]
    Tried visualizing an entire script using Dall-E 3 and these are the results.
    https://preview.redd.it/vi9wx005ksvb1.jpg?width=1024&format=pjpg&auto=webp&s=75502abcae7f2337693175101cb3491b8647d70d Revived an old script and made some images for it using Dall-E 3, just to test out the workflow: https://docs.google.com/document/d/1yyWRRmd0ah5Z4u8_aNYSq9csJ8pccP24Dcs9brPHbzs/edit Was pretty fun and I think by the end I got much better at learning how to maintain the consistency between characters, direct shots, etc. -~- submitted by /u/Kulimar [link] [comments]
    Combing Thermodynamics and Diffusion Models for Collision-Free Robot Motion Planning
    Researchers from Yonsei University and UC Berkeley recently developed a new AI method for enabling autonomous robots to navigate unfamiliar environments filled with obstacles using only visual data as input. The key innovation is a customized diffusion model. Diffusion models can generate diverse motion plans by adding controlled noise. The researchers tailored the model to mimic how heat avoids insulation when dispersing through space. Similar to heat navigating around insulators, this "collision-avoiding" diffusion model learns to predict robot motions that avoid collisions with obstacles. It generates reachable goals and viable motion plans to those goals simultaneously. In simulations, this approach achieved ~98% success rates in navigating to target destinations while avoiding randomly generated obstacles using only visual map images as input. While extensive real-world testing is still needed (only 2D, only simulation), these initial results showcase promising capabilities: Enables navigation in unfamiliar environments without pre-mapping. Flexibly identifies and progresses toward reachable goals. Avoids unnecessary sensing systems for obstacle avoidance. Learns complex collision avoidance heuristics from visual data. I like the thermo + AI + robotics combination here - takes me back to my days in aerospace engineering. Pretty interesting approach. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]
    I upgraded my AI girlfriend… and now she remembers stuff about me..
    submitted by /u/spaceecon [link] [comments]
    Self-learning AI Movement Prediction: Beyond Airstriker Genesis to multi-directional predictions
    Quick update on my self-learning software experiment: Thanks to your feedback, I decided to test my prediction system on a newer tower-defense game from the Apple App Store (simply called ‘The Tower’). What's crucial to remember is that this system is not pre-trained and only learns from the current game it encounters - it starts with zero knowledge and learns exclusively from the game it's currently playing, building from the ground up without the use of deep learning or neural networks. In this game (unlike Airstriker which I’ve previously used), players don't control a spaceship or fire weapons (you play the game by ‘upgrading’ your weapons, etc.). It's simpler because there's only one type of enemy that always approaches the center, so the system cannot demonstrate its capabilities for differentiation in this case. But this simplicity presents some other interesting challenges: Enemies approach from all 360-degree directions, pushing the boundaries of the path prediction software. They overlap during explosions, demanding the system to separate them. There's also more visual clutter, including static lines and a non-black background. The system's predictive performance has been remarkably strong. I’ve put together an overlay video to visually demonstrate how the system learns and adapts in this new game. Note: If things don’t align perfectly in there, it’s due to my poor video editing skills… Your feedback is appreciated as always! submitted by /u/_timmah_ [link] [comments]
    A scary thought...
    Without us, artificial intelligence just becomes intelligence submitted by /u/cognaceast [link] [comments]
    AI RPG DALL-E 3
    submitted by /u/the_anonymizer [link] [comments]
    Could machine learning produce a "simple" AI algorithm that performs better than what a human programmer could create in a reasonable amount of time?
    Let me clarify what I'm asking through an example: Artificial Intelligence in videogames has failed to develop in any meaningful way over the past two decades, at least as far as the typical end-user is concerned, and nowhere is this more apparent than in strategy games. Whether we're talking about the 90's or today, AI opponents typically have to receive significant cheats in order to provide a challenging experience for the player. This is widely considered undesirable, can harm immersion or a sense of fair-play, and leads to the concept of "cheesing" the AI (exploiting obvious weaknesses in the AI logic, something which is sometimes necessary if an AI receives such strong bonuses that any strategy you might attempt against another human player would be impossible to execute successfull…
  • Open

    Stability of a superegg
    Three weeks ago I wrote about supereggs, a shape popularized by Piet Hein. One aspect of supereggs that I did not address is their stability. Looking at the photo above, you could imagine that if you gave the object a slight nudge it would not fall over. Your intuition would be right: supereggs are stable. […] Stability of a superegg first appeared on John D. Cook.  ( 6 min )
    Best-of-five versus Best-of-seven
    Suppose that when Team X and Team Y play, the probability that X will win a single game is p and the probability that Y will win is q = 1 − p. What is the probability that X will win the majority of a series of N games for some odd number N? We know intuitively […] Best-of-five versus Best-of-seven first appeared on John D. Cook.  ( 5 min )
  • Open

    How the Self Play algorithm masters Multi-Agent AI
    submitted by /u/AvvYaa [link] [comments]
    Mujoco RL Robotic Arm
    Hi everyone, I'm new to robotic arms and I want to learn more about how to implement them using mujoco env. I'm looking for some open-source projects on github that I can run and understand. I tried MuJoCo_RL_UR5 repo but it didn't work well for me, it only deployed a random agent. Do you have any recommendations for good repos that are beginner-friendly and well-documented? submitted by /u/satyamstar [link] [comments]
    Why does bellman equation converge?
    After multiple iterations the value function converge by bellaman updates (vale iteration algorithm). Can someone provide a intuitive reasoning why the value converges? submitted by /u/RaceCondition01 [link] [comments]
  • Open

    Replay game input with image classification
    TensorFlow Keras correcting camera horizon in AC Valhalla https://www.youtube.com/watch?v=ASy-2zOMj_Y submitted by /u/Kostiantyn-Dvornik [link] [comments]
    How I determine neuron layers and amount of neurons in?
    Hello, I’m newbie in neural networks and I wonder, how do people decide how many hidden layers there will be and how many neurons will be inside? What the logic behind? submitted by /u/Particular-Song-633 [link] [comments]
    Unboxing Neuro Symbolic Reasoning and Learning
    submitted by /u/Neurosymbolic [link] [comments]

  • Open

    Google, other search engines' use of generative AI threatens $68B SEO industry
    The rise of generative AI in search engines like Google threatens the $68 billion search engine optimization (SEO) industry. Generative AI tools like ChatGPT aim to provide direct answers to user queries, bypassing the need for users to click on search results. This could render SEO efforts useless and impact the revenues of SEO consultants and search engines. However, generative AI search engines still face challenges such as providing incorrect or plagiarized answers, and gaining user trust and loyalty. Search engines have been quick to experiment with generative AI to improve search results, with Google's Bard, Microsoft's Bing AI, Baidu's ERNIE, and DuckDuckGo's DuckAssist being examples of this approach. As the quality of AI-generated answers improves, users will have less incentive to browse through search result listings, impacting the revenues of SEO consultants and search engines. The SEO industry generated $68.1 billion globally in 2022 and was expected to reach $129.6 billion by 2030, but the emergence of generative AI puts the industry at risk of obsolescence. Generative AI search engines are still in their infancy and face challenges such as providing incorrect or plagiarized answers, limiting their trust and loyalty among users. However, with the resources available to researchers, it is safe to assume that generative AI models will improve over time, leading to the potential death of the SEO industry. Source : https://theconversation.com/why-google-bing-and-other-search-engines-embrace-of-generative-ai-threatens-68-billion-seo-industry-210243 submitted by /u/NuseAI [link] [comments]
    Experimented with Fully Automating TikTok Video Creation Using AI for a Month - Here's What I Learned
    Hi everyone, I recently undertook a personal project where I tried to automate the entire process of creating TikTok videos using various AI tools. The goal was to see how advanced we've come in terms of AI's capabilities in content creation and to explore the nuances of automating a traditionally 'human' task. Here's a brief breakdown: Scripting: Leveraged ChatGPT for generating video scripts. Voiceovers: Used ElevenLabs for lifelike voice narration. Video Creation: Employed a combination of StableDiffusion Animate & Replicate. Editing: Automated the editing process to sync with the AI-generated voiceovers. After setting everything up, I ran the system for a month, generating 3 videos daily. The results were intriguing and a mix of expected and unexpected outcomes. Would love to hear thoughts, feedback, or similar experiences from the community. Are there other creative ways you've seen or used AI in content creation? submitted by /u/General_crypto [link] [comments]
    AI RPG (Dall-E 3)
    submitted by /u/the_anonymizer [link] [comments]
    Thanks to AI, the future of programming may involve YELLING IN ALL CAPS
    The future of programming may involve human-like communication techniques, including yelling in all caps. OpenAI's DALL-E 3 AI image generator integrated into ChatGPT revealed internal prompts shared between the image generator and the AI assistant. The prompts included commands written in all-caps for emphasis. This shows that programming and communicating with computers may become more human-like in the future. Previously, programs used specialized data formats and APIs to communicate, but now large language models allow for cross-program interaction in conventional English. OpenAI trained GPT-4, the AI model used in ChatGPT DALL-E interface, on hundreds of millions of documents scraped from the web, which included instances of polite language and reactions to it. The use of all-caps in the DALL-E message is interpreted as emphasis, and the model pays more attention to capitalized sentences. In the future, programming and communicating with computers may involve more emphasis and human-like communication techniques. Source : https://arstechnica.com/information-technology/2023/10/thanks-to-ai-the-future-of-programming-may-involve-yelling-in-all-caps/ submitted by /u/NuseAI [link] [comments]
    Close up view of rain hitting dust.
    submitted by /u/IllustriousVideo6145 [link] [comments]
    Impressive
    submitted by /u/the_anonymizer [link] [comments]
  • Open

    Policy Evaluation
    I know that given a policy, I can find the value function using iterative policy evaluation. Can I, given the value function, find the policy? submitted by /u/MomoSolar [link] [comments]
    Question on advantage (re-)computation for PPO
    Hi, I've been re-reading the "What matters in on-policy reinforcement learning" paper (https://arxiv.org/abs/2006.05990), and noticed that they suggest to recompute advantages at the beginning of each epoch (choice C5, see section 3.5 and appendix B.1). I was wondering: if someone here had already tried this and seen a significant improvement (which is what the paper suggests) ? if it did not also suppose to recompute the value targets at the beginning of each epoch, which could lead to some sort of moving target issue ? Best, submitted by /u/Scrimbibete [link] [comments]
    In RL, how can we reward an action taken 5 steps ago?
    Let us say we are building a model that will learn how to play a computer game like DOTA or league of legends. If model for example, buys weapon A, and use the item's ability on opponent B, it should learn what damage it gives to opponent given the items opponent B is wearing. But we would have done a lot of other actions in between before being able to use that weapon to reward the model on what it does / how much damage it made. How does do you do delayed reward for specific action made X number of steps ago? Thank you. submitted by /u/oniongarlic88 [link] [comments]
  • Open

    [D] PRML reading buddy
    Hey there mates, I am a 3rd year PhD student, trying to break into good quality research (tired of trying different permutations ans combinations of X and Ys, and hitting dead end when things don't work, or worse -- being unable to explain why things work :D). I have recently decided to read PRML cover to cover (slowly) and do some of the exercises as well. Goal is to finish in 6 months (2 chapters per month). Is there anyone on a similar journey, would love to tag along and discuss nuances? submitted by /u/Zealousideal_Yak9131 [link] [comments]  ( 9 min )
    [D] What do you all think of these pearls of wisdom on “Doing Great Research”?
    About the latest Jason Wei’s tweet. submitted by /u/mildlyphd [link] [comments]  ( 9 min )
    [D] Which is the best physics engine for reinforcement learning??
    What are some of the best physics engine that we should be using to implement physics for complex reinforcement related tasks(like humanoid motions) ?? I came across mujoco, physx , pybullet, issac etc but not sure which to go with. Isaac seems to be something very interesting but the minimun requirements as per the website is 32gb of RAM which is way to much for me (I use a 8gb one). mujoco is good but the docs are very confusing and hard to get through. what do you believe is the best choice to go with?? submitted by /u/rakk109 [link] [comments]  ( 9 min )
    [D] Ensemble of Strong vs Weak Predictors
    This crossed my mind recently and after searching online I couldn't find a concrete answer: would an ensemble composed of strong predictors (let's say training on 1 model of that type had a high metric performance) perform better than an ensemble composed of weak predictors? Bonus: are there any resources that would support your position you can link below? submitted by /u/robml [link] [comments]  ( 9 min )
    [R] Decoupling Features and Classes with Self-Organizing Class Embeddings
    submitted by /u/4rtemi5 [link] [comments]  ( 9 min )
    [P] [D] Hierarchical agent learns all possible policies. Would this implementation work?
    Here's my implementation of an idea I had many years ago: a Sensorimotor Inference Engine. A machine that explores the states space of an environment, learning how to traverse the state space, learning how to manipulate the environment, which when given a goal can manipulate the environment in accordance to the goal. In other words, it's an agent which learns not one policy, but all possible policies. Doing so, I believe requires a hierarchy: layers of the same structure which learn broader and broader contexts of the environment. I have recently attempted to design an extremely simple, and modularized version of this agent: The Encoder-Predictor-Actor circuit. I need feedback, do you think it would work? if it might work, how might I train the Actor model? I think I know how to train the Encoder and Predictor models, but the Actor model will be harder to train, so if you have any ideas I'd love to hear from you! ps. sorry for the typos in the image text. a first-pass diagram of the 'simplest' implementation of a sensorimotor inference engine: the encoder-predictor-actor circuit submitted by /u/Stack3 [link] [comments]  ( 9 min )
    Career suggestions [D]
    Hi there, I need some suggestions from you experts. I am an aerospace engineer (both BSc and MSc), with a university minor in AI. It's pretty clear to me that I should have studied computer science given my passion for this world. In the last 4 years I worked as engineer in a major aerospace company, and I managed to get back on track with computer science and ML by working as a data scientist and doing ML projects applied to space, while also practicing with LLM agents. My dream is to enter the AGI world, maybe working as an "AI engineer", or working on creating true "autonomous" systems, leveraging multi-modal models maybe. What do you suggest I should focus on to reach this goal? Getting first some "credit" as an ML engineer though courses and certifications, open source projects, or maybe applying right now to some startups in the field? Thanks guys! submitted by /u/cappellino1 [link] [comments]  ( 9 min )
    A[r]xiv Dives - Generating Speech from Text with Fast Speech-2
    We’ve been diving deep into Arxiv Papers as a team on Fridays, hope it’s helpful and feel free to join live if you like the format! submitted by /u/FallMindless3563 [link] [comments]  ( 9 min )
    [R] Eureka: Human-Level Reward Design via Coding Large Language Models
    submitted by /u/MysteryInc152 [link] [comments]  ( 8 min )
    Computer Vision Project Ideas [Project]
    I am taking the computer vision course at my university. We have to do a final project but I am unable to come up with concrete ideas. These are the options: • Select a paper from the computer vision literature, implement and test the approach described in that paper • Take publicly available code, apply it to an interesting novel dataset and explore various extensions and modifications. You may also want to compare two or more systems. Running existing code on the data provided by the authors is not sufficient. • Design and implement a solution to a problem that interests you. This may earn you extra credits. Can anyone please help with what to do? submitted by /u/kxenak [link] [comments]  ( 9 min )
    [D] Can you use a different dataset to run ablation experiments?
    I am on a computer vision algorithm and I will be benchmarking my method on the MS COCO dataset, like the other methods that have been proposed for the same problem. I want to know if I can use a smaller dataset (COCO minitrain) for my ablation experiments to demonstrate the efficacy of the different components used in my algorithm and to save time and cost, or will that be a red flag to journal reviewers? submitted by /u/notEVOLVED [link] [comments]  ( 9 min )
    Searching pinecone for relative date information [R]
    I am embedding with gpt and upserting large medical reports into pinecone and then would like to query for chronological result. For example, I upload a report that consists of 10 office visits. I would like to know the date and results of the first visit and then the last visit. when I embed a query containing: How did the patient describe their pain in the last office visit in the text? pinecone doesn't understand the context of 'last' since it is just doing cosine likeness. It pulls pain information but doesn't have a clue which comes first. Any help would be greatly appreciated. submitted by /u/Silent_Case_3058 [link] [comments]  ( 9 min )
    [P] Wizard101 Auto-Buyer Script/Bot - Using OCR, OpenCV Python with multiprocessor performance improvements
    submitted by /u/HistorianCrafty3514 [link] [comments]  ( 8 min )
    [P] [D] : RAG on multilevel tabular data
    Hi, Has anyone done RAG on a multi level tabular data? If yes then what problems have you faced and how did you solve those? My model gives better answers when I converted the data to a JSON and then embedded it. But I'm looking for a better approach. submitted by /u/Euphoric-Chart1428 [link] [comments]  ( 9 min )

  • Open

    Implementing Gradient Descent in PyTorch
    The gradient descent algorithm is one of the most popular techniques for training deep neural networks. It has many applications in fields such as computer vision, speech recognition, and natural language processing. While the idea of gradient descent has been around for decades, it’s only recently that it’s been applied to applications related to deep […] The post Implementing Gradient Descent in PyTorch appeared first on MachineLearningMastery.com.  ( 25 min )

  • Open

    Training a Linear Regression Model in PyTorch
    Linear regression is a simple yet powerful technique for predicting the values of variables based on other variables. It is often used for modeling relationships between two or more continuous variables, such as the relationship between income and age, or the relationship between weight and height. Likewise, linear regression can be used to predict continuous […] The post Training a Linear Regression Model in PyTorch appeared first on MachineLearningMastery.com.  ( 24 min )
    Making Linear Predictions in PyTorch
    Linear regression is a statistical technique for estimating the relationship between two variables. A simple example of linear regression is to predict the height of someone based on the square root of the person’s weight (that’s what BMI is based on). To do this, we need to find the slope and intercept of the line. […] The post Making Linear Predictions in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Loading and Providing Datasets in PyTorch
    Structuring the data pipeline in a way that it can be effortlessly linked to your deep learning model is an important aspect of any deep learning-based system. PyTorch packs everything to do just that. While in the previous tutorial, we used simple datasets, we’ll need to work with larger datasets in real world scenarios in […] The post Loading and Providing Datasets in PyTorch appeared first on MachineLearningMastery.com.  ( 20 min )

  • Open

    Using Dataset Classes in PyTorch
    In machine learning and deep learning problems, a lot of effort goes into preparing the data. Data is usually messy and needs to be preprocessed before it can be used for training a model. If the data is not prepared correctly, the model won’t be able to generalize well. Some of the common steps required […] The post Using Dataset Classes in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Calculating Derivatives in PyTorch
    Derivatives are one of the most fundamental concepts in calculus. They describe how changes in the variable inputs affect the function outputs. The objective of this article is to provide a high-level introduction to calculating derivatives in PyTorch for those who are new to the framework. PyTorch offers a convenient way to calculate derivatives for […] The post Calculating Derivatives in PyTorch appeared first on Machine Learning Mastery.  ( 20 min )

  • Open

    Two-Dimensional Tensors in Pytorch
    Two-dimensional tensors are analogous to two-dimensional metrics. Like a two-dimensional metric, a two-dimensional tensor also has $n$ number of rows and columns. Let’s take a gray-scale image as an example, which is a two-dimensional matrix of numeric values, commonly known as pixels. Ranging from ‘0’ to ‘255’, each number represents a pixel intensity value. Here, […] The post Two-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 21 min )

  • Open

    One-Dimensional Tensors in Pytorch
    PyTorch is an open-source deep learning framework based on Python language. It allows you to build, train, and deploy deep learning models, offering a lot of versatility and efficiency. PyTorch is primarily focused on tensor operations while a tensor can be a number, matrix, or a multi-dimensional array. In this tutorial, we will perform some […] The post One-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 22 min )

  • Open

    365 Data Science courses free until November 21
    Sponsored Post   The unlimited access initiative presents a risk-free way to break into data science.     The online educational platform 365 Data Science launches the #21DaysFREE campaign and provides 100% free unlimited access to all content for three weeks. From November 1 to 21, you can take courses from renowned instructors and earn […] The post 365 Data Science courses free until November 21 appeared first on Machine Learning Mastery.  ( 15 min )

  • Open

    Attend the Data Science Symposium 2022, November 8 in Cincinnati
    Sponsored Post      Attend the Data Science Symposium 2022 on November 8 The Center for Business Analytics at the University of Cincinnati will present its annual Data Science Symposium 2022 on November 8. This all day in-person event will have three featured speakers and two tech talk tracks with four concurrent presentations in each track. The […] The post Attend the Data Science Symposium 2022, November 8 in Cincinnati appeared first on Machine Learning Mastery.  ( 10 min )

  • Open

    My family's unlikely homeschooling journey
    My husband Jeremy and I never intended to homeschool, and yet we have now, unexpectedly, committed to homeschooling long-term. Prior to the pandemic, we both worked full-time in careers that we loved and found meaningful, and we sent our daughter to a full-day Montessori school. Although I struggled with significant health issues, I felt unbelievably lucky and fulfilled in both my family life and my professional life. The pandemic upended my careful balance. Every family is different, with different needs, circumstances, and constraints, and what works for one may not work for others. My intention here is primarily to share the journey of my own (very privileged) family. Our unplanned introduction to homeschooling For the first year of the pandemic, most schools in California, where …  ( 7 min )

  • Open

    The Jupyter+git problem is now solved
    Jupyter notebooks don’t work with git by default. With nbdev2, the Jupyter+git problem has been totally solved. It provides a set of hooks which provide clean git diffs, solve most git conflicts automatically, and ensure that any remaining conflicts can be resolved entirely within the standard Jupyter notebook environment. To get started, follow the directions on Git-friendly Jupyter. Contents The Jupyter+git problem The solution The nbdev2 git merge driver The nbdev2 Jupyter save hook Background The result Postscript: other Jupyter+git tools ReviewNB An alternative solution: Jupytext nbdime The Jupyter+git problem Jupyter notebooks are a powerful tool for scientists, engineers, technical writers, students, teachers, and more. They provide an ideal notebook environment for interact…  ( 7 min )
2023-11-20T00:46:50.840Z osmosfeed 1.15.1